Hook Hua edited this page on Aug 28, 2018 · 7 revisions

Page Navigation:

Table of Contents

Confidence Level TBD This article has not been reviewed for accuracy, timeliness, or completeness. Check that this information is valid before acting on it.

See Watchdogs for implementation notes.

Separate from Mozart, HySDS employs cron scripts that check Mozart periodically and sends slackbot notifications to the team when specific conditions are detected. Watchdog slackbot notifications are sent to a Slack team channel for the following conditions:

(1) Processing is alive

Monitors for job types that have not successfully completed in timeout period.

Slackbot notification

Job Status checking for job type
- The last successfully completed job of type was N-hours ago.
- The last failed job of type was N-hours ago.
  - Example: NoClobberException: Destination, s3://s3-us-west-2.amazonaws.com:80/....dataset.json, already exists and no-clobber is set.
  - Example: SoftTimeLimitExceeded: SoftTimeLimitExceeded()

Suggested Mitigation Plan for this alert

(to be filled out)

Related tickets

[closed] update watchdog script slackbot to show last few failed jobs instead of last successful job #498
[closed] add watchdog script to monitor for mozart jobs that have not successfully completed in timeout period #491

(2) Processing job timeouts

Monitors for job-started and job-offline timeouts.

Slackbot notification

Queue Status Alert:
- Alert: Possible Job hanging in the queue

Suggested Mitigation Plan for this alert

(to be filled out)

Related tickets

[closed] add watchdog script to monitor for job-started and job-offline timeouts #232, which tagged resource=job with tag=timedout
TODO: Need to change to variant that emits slackbot notifications.
TODO: check for stale job-start and job-offline , and send notification to slackbot notification For each (job-start + job-offline) jobs Look up that job’s soft time limit Check if (time started + soft time limit + margin) > now. That is if the time limit with some margin (e.g. 5-mins) has elapsed. If it has, the job should have been killed by celery/verdi by now. But if not, the job as shown in Mozart will be stale and stuck in job-started state. Send notifications for these.

(3) Processing task timeouts

Monitors for celery task-started timeouts.

Slackbot notification

(to be filled out)

Suggested Mitigation Plan for this alert

(to be filled out)

Related tickets

[closed] add watchdog script to monitor for task-started timeouts #231
TODO: Similar to checking for stale job-started timeouts

(4) Worker is alive

Monitors for worker-heartbeat timeouts.

Slackbot notification

(to be filled out)

Suggested Mitigation Plan for this alert

(to be filled out)

Related tickets

[open] add watchdog script to monitor for worker-heartbeat timeouts #230
TODO: Change to emit slackbot notification instead of tagging resource=worker in ES

(5) Queue churn

Monitors for if have active workers draining each queue. Checks if workers exists that drain each non-empty queue

Slackbot notification

Queue Status Alert:
- Alert : No job running though jobs are waiting in the queue!!

Suggested Mitigation Plan for this alert

This error may fix itself if ASG kicks in after the watchdog check was last done.

Related tickets

📖 Related Articles:

Filter by label (Content by label)

showLabels	false
max	12
showSpace	false
sort	title	showSpace	false
cql	label = "mozart"

Have Questions? Ask a HySDS Developer:

Anyone can join our public Slack channelto learn more about HySDS. JPL employees can join #HySDS-Community

JPLers can also ask HySDS questions atStack Overflow Enterprise

Live Search

placeholder	Search HySDS Wiki

🚀 Page Information:

Was this page useful?

Yes No

Contribution History:

Contributors

mode	list
showLastTime	true
order	update

Subject Matter Expert:

Hook Hua

Find an Error?

Is this document outdated or inaccurate? Please contact the assigned Subject Matter ExpertPage Maintainer:

Hook Hua

Versions Compared

Old Version 4

New Version Current

Key

(1) Processing is alive

Slackbot notification

Suggested Mitigation Plan for this alert

Related tickets

(2) Processing job timeouts

Slackbot notification

Suggested Mitigation Plan for this alert

Related tickets

(3) Processing task timeouts

Slackbot notification

Suggested Mitigation Plan for this alert

Related tickets

(4) Worker is alive

Slackbot notification

Suggested Mitigation Plan for this alert

Related tickets

(5) Queue churn

Slackbot notification

Suggested Mitigation Plan for this alert

Related tickets

Page Comparison

Versions Compared

Old Version 4

New Version Current

Key

(1) Processing is alive

Slackbot notification

Suggested Mitigation Plan for this alert

Related tickets

(2) Processing job timeouts

Slackbot notification

Suggested Mitigation Plan for this alert

Related tickets

(3) Processing task timeouts

Slackbot notification

Suggested Mitigation Plan for this alert

Related tickets

(4) Worker is alive

Slackbot notification

Suggested Mitigation Plan for this alert

Related tickets

(5) Queue churn

Slackbot notification

Suggested Mitigation Plan for this alert

Related tickets