Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Scenario: Due to unexpected use of disk space on Mozart, one or more partition fills up such that Mozart processes hang and stop functioning. This could happen if we have so many jobs in Mozart, we Elasticsearch outgrows the partition the documents are stored on. Or it could happen because of logfiles. In these cases, even though the Figaro interface shows jobs running, no jobs are actually active and no workers are active. What is worse is that this affects whatever processes decrements the ASGs when harikiri is called, meaning we have idle instances on AWS doing zero work.

When you have partially shutdown, idle instances in AWS

Scenario: We used to have a problem where, due to a paging bug in how HySDS got its list of ASGs, the process that decrements the number of desired instances in an ASG did not work properly. We would have a situation where an ASG would scale up, but even though the node is shut down, the ASG stays at the original desired capacity. This means we are paying for instances when there are no longer jobs in the queue, and also not able to run our jobs effectively because one or more of the instances in the ASG are in an incompletely shutdown state and not able to service jobs.

...