Operational Pain Points
Confidence Level TBD This article has not been reviewed for accuracy, timeliness, or completeness. Check that this information is valid before acting on it. |
---|
This page should record general operational issues we have seen in operating HySDS for real projects
When logs exist on PCM instance, but nobody is paying attention
Scenario: We use cronjobs on factotum to submit jobs to HySDS that need to run periodically. But we don’t check the cronlogs. There has been months worth of errors in the cron log files that no one was checking.
When you run out of disk space on a PCM instance
Scenario: Due to unexpected use of disk space on Mozart, one or more partition fills up such that Mozart processes hang and stop functioning. This could happen if we have so many jobs in Mozart, we Elasticsearch outgrows the partition the documents are stored on. Or it could happen because of logfiles. In these cases, even though the Figaro interface shows jobs running, no jobs are actually active and no workers are active. What is worse is that this affects whatever processes decrements the ASGs when harikiri is called, meaning we have idle instances on AWS doing zero work.
When you have partially shutdown, idle instances in AWS
Scenario: We used to have a problem where, due to a paging bug in how HySDS got its list of ASGs, the process that decrements the number of desired instances in an ASG did not work properly. We would have a situation where an ASG would scale up, but even though the node is shut down, the ASG stays at the original desired capacity. This means we are paying for instances when there are no longer jobs in the queue, and also not able to run our jobs effectively because one or more of the instances in the ASG are in an incompletely shutdown state and not able to service jobs.
When your job configuration is messed up such that the worker queue keeps being reset to an undesired queue everytime you edit the trigger rule for the associated action.
Scenario: I have an issue where the default queue is wrong for an action I have a trigger rule for. I will reset the queue to my desired queue when I submit the trigger rule. However, every time I edit the trigger rule, the queue will be reset to the default queue--which is invalid--, and my jobs are either sent to an inappropriate node to fail, or they sit in RabbitMQ, because I don’t have any workers listening to the invalid queue.
When you have a backlog of 10,000 jobs, of which 9,999 are unnecessary due to the idemptotent nature of the job.
Scenario: There is a backlog of idemptotent jobs in the system after one of the workers either takes an excessively long time to run or it just hangs until the operations engineer intervenes. The job is designed to be run periodically, say every hour. They are not designed to be run in quick succession nothing will have changed between what job #1 in the queue accomplished versus what job #2 in the queue accomplished. When there is a backlog of this type of job, there is really no point in running 99% of the jobs, and they should be ignored to avoid unnecessary work.