Mozart Watchdog Notifications
Hook Hua edited this page on Aug 28, 2018 · 7 revisions
Confidence Level TBD This article has not been reviewed for accuracy, timeliness, or completeness. Check that this information is valid before acting on it. |
---|
See Watchdogs for implementation notes.
Separate from Mozart, HySDS employs cron scripts that check Mozart periodically and sends slackbot notifications to the team when specific conditions are detected. Watchdog slackbot notifications are sent to a Slack team channel for the following conditions:
(1) Processing is alive
Monitors for job types that have not successfully completed in timeout period.
Slackbot notification
Job Status checking for job type
The last successfully completed job of type was N-hours ago.
The last failed job of type was N-hours ago.
Example: NoClobberException: Destination, s3://s3-us-west-2.amazonaws.com:80/....dataset.json, already exists and no-clobber is set.
Example: SoftTimeLimitExceeded: SoftTimeLimitExceeded()
Suggested Mitigation Plan for this alert
(to be filled out)
Related tickets
[closed] update watchdog script slackbot to show last few failed jobs instead of last successful job #498
[closed] add watchdog script to monitor for mozart jobs that have not successfully completed in timeout period #491
(2) Processing job timeouts
Monitors for job-started and job-offline timeouts.
Slackbot notification
Queue Status Alert:
Alert: Possible Job hanging in the queue
Suggested Mitigation Plan for this alert
(to be filled out)
Related tickets
[closed] add watchdog script to monitor for job-started and job-offline timeouts #232, which tagged resource=job with tag=timedout
TODO: Need to change to variant that emits slackbot notifications.
TODO: check for stale job-start and job-offline , and send notification to slackbot notification For each (job-start + job-offline) jobs Look up that job’s soft time limit Check if (time started + soft time limit + margin) > now. That is if the time limit with some margin (e.g. 5-mins) has elapsed. If it has, the job should have been killed by celery/verdi by now. But if not, the job as shown in Mozart will be stale and stuck in job-started state. Send notifications for these.
(3) Processing task timeouts
Monitors for celery task-started timeouts.
Slackbot notification
(to be filled out)
Suggested Mitigation Plan for this alert
(to be filled out)
Related tickets
[closed] add watchdog script to monitor for task-started timeouts #231
TODO: Similar to checking for stale job-started timeouts
(4) Worker is alive
Monitors for worker-heartbeat timeouts.
Slackbot notification
(to be filled out)
Suggested Mitigation Plan for this alert
(to be filled out)
Related tickets
[open] add watchdog script to monitor for worker-heartbeat timeouts #230
TODO: Change to emit slackbot notification instead of tagging resource=worker in ES
(5) Queue churn
Monitors for if have active workers draining each queue. Checks if workers exists that drain each non-empty queue
Slackbot notification
Queue Status Alert:
Alert : No job running though jobs are waiting in the queue!!
Suggested Mitigation Plan for this alert
This error may fix itself if ASG kicks in after the watchdog check was last done.
Related tickets
Related Articles: |
---|
Have Questions? Ask a HySDS Developer: |
Anyone can join our public Slack channel to learn more about HySDS. JPL employees can join #HySDS-Community
|
JPLers can also ask HySDS questions at Stack Overflow Enterprise
|
Page Information: |
---|
Was this page useful? |
Contribution History:
|
Subject Matter Expert: @Hook Hua |
Find an Error? Is this document outdated or inaccurate? Please contact the assigned Page Maintainer: @Hook Hua |