Mozart Watchdog Notifications

Hook Hua edited this page on Aug 28, 2018 · 7 revisions


Confidence Level TBD  This article has not been reviewed for accuracy, timeliness, or completeness. Check that this information is valid before acting on it.

Confidence Level TBD  This article has not been reviewed for accuracy, timeliness, or completeness. Check that this information is valid before acting on it.

 

See Watchdogs for implementation notes.

Separate from Mozart, HySDS employs cron scripts that check Mozart periodically and sends slackbot notifications to the team when specific conditions are detected. Watchdog slackbot notifications are sent to a Slack team channel for the following conditions:

(1) Processing is alive

Monitors for job types that have not successfully completed in timeout period.

Slackbot notification

  • Job Status checking for job type

    • The last successfully completed job of type was N-hours ago.

    • The last failed job of type was N-hours ago.

Suggested Mitigation Plan for this alert

  • (to be filled out)

  • [closed] update watchdog script slackbot to show last few failed jobs instead of last successful job #498

  • [closed] add watchdog script to monitor for mozart jobs that have not successfully completed in timeout period #491

(2) Processing job timeouts

Monitors for job-started and job-offline timeouts.

Slackbot notification

  • Queue Status Alert:

    • Alert: Possible Job hanging in the queue

Suggested Mitigation Plan for this alert

  • (to be filled out)

  • [closed] add watchdog script to monitor for job-started and job-offline timeouts #232, which tagged resource=job with tag=timedout

  • TODO: Need to change to variant that emits slackbot notifications.

  • TODO: check for stale job-start and job-offline , and send notification to slackbot notification For each (job-start + job-offline) jobs Look up that job’s soft time limit Check if (time started + soft time limit + margin) > now. That is if the time limit with some margin (e.g. 5-mins) has elapsed. If it has, the job should have been killed by celery/verdi by now. But if not, the job as shown in Mozart will be stale and stuck in job-started state. Send notifications for these.

(3) Processing task timeouts

Monitors for celery task-started timeouts.

Slackbot notification

  • (to be filled out)

Suggested Mitigation Plan for this alert

  • (to be filled out)

  • [closed] add watchdog script to monitor for task-started timeouts #231

  • TODO: Similar to checking for stale job-started timeouts

(4) Worker is alive

Monitors for worker-heartbeat timeouts.

Slackbot notification

  • (to be filled out)

Suggested Mitigation Plan for this alert

  • (to be filled out)

  • [open] add watchdog script to monitor for worker-heartbeat timeouts #230

  • TODO: Change to emit slackbot notification instead of tagging resource=worker in ES

(5) Queue churn

Monitors for if have active workers draining each queue. Checks if workers exists that drain each non-empty queue

Slackbot notification

  • Queue Status Alert:

    • Alert : No job running though jobs are waiting in the queue!!

Suggested Mitigation Plan for this alert

  • This error may fix itself if ASG kicks in after the watchdog check was last done.

 

 


Related Articles:

Related Articles:

Have Questions? Ask a HySDS Developer:

Anyone can join our public Slack channel to learn more about HySDS. JPL employees can join #HySDS-Community

JPLers can also ask HySDS questions at Stack Overflow Enterprise

Search HySDS Wiki

Page Information:

Page Information:

Was this page useful?

Yes No

Contribution History:

Subject Matter Expert:

@Hook Hua

Find an Error?

Is this document outdated or inaccurate? Please contact the assigned Page Maintainer:

@Hook Hua

Note: JPL employees can also get answers to HySDS questions at Stack Overflow Enterprise: