Hook Hua edited this page on Aug 28, 2018 · 7 revisions

Page Navigation:

Page Navigation:

Confidence Level TBD This article has not been reviewed for accuracy, timeliness, or completeness. Check that this information is valid before acting on it.

Confidence Level TBD This article has not been reviewed for accuracy, timeliness, or completeness. Check that this information is valid before acting on it.

See Watchdogs for implementation notes.

Separate from Mozart, HySDS employs cron scripts that check Mozart periodically and sends slackbot notifications to the team when specific conditions are detected. Watchdog slackbot notifications are sent to a Slack team channel for the following conditions:

(1) Processing is alive

Monitors for job types that have not successfully completed in timeout period.

Slackbot notification

Job Status checking for job type
- The last successfully completed job of type was N-hours ago.
- The last failed job of type was N-hours ago.
  - Example: NoClobberException: Destination, s3://s3-us-west-2.amazonaws.com:80/....dataset.json, already exists and no-clobber is set.
  - Example: SoftTimeLimitExceeded: SoftTimeLimitExceeded()

Suggested Mitigation Plan for this alert

(to be filled out)

Related tickets

[closed] update watchdog script slackbot to show last few failed jobs instead of last successful job #498
[closed] add watchdog script to monitor for mozart jobs that have not successfully completed in timeout period #491

(2) Processing job timeouts

Monitors for job-started and job-offline timeouts.

Slackbot notification

Queue Status Alert:
- Alert: Possible Job hanging in the queue

Suggested Mitigation Plan for this alert

(to be filled out)

Related tickets

[closed] add watchdog script to monitor for job-started and job-offline timeouts #232, which tagged resource=job with tag=timedout
TODO: Need to change to variant that emits slackbot notifications.
TODO: check for stale job-start and job-offline , and send notification to slackbot notification For each (job-start + job-offline) jobs Look up that job’s soft time limit Check if (time started + soft time limit + margin) > now. That is if the time limit with some margin (e.g. 5-mins) has elapsed. If it has, the job should have been killed by celery/verdi by now. But if not, the job as shown in Mozart will be stale and stuck in job-started state. Send notifications for these.

(3) Processing task timeouts

Monitors for celery task-started timeouts.

Slackbot notification

(to be filled out)

Suggested Mitigation Plan for this alert

(to be filled out)

Related tickets

[closed] add watchdog script to monitor for task-started timeouts #231
TODO: Similar to checking for stale job-started timeouts

(4) Worker is alive

Monitors for worker-heartbeat timeouts.

Slackbot notification

(to be filled out)

Suggested Mitigation Plan for this alert

(to be filled out)

Related tickets

[open] add watchdog script to monitor for worker-heartbeat timeouts #230
TODO: Change to emit slackbot notification instead of tagging resource=worker in ES

(5) Queue churn

Monitors for if have active workers draining each queue. Checks if workers exists that drain each non-empty queue

Slackbot notification

Queue Status Alert:
- Alert : No job running though jobs are waiting in the queue!!

Suggested Mitigation Plan for this alert

This error may fix itself if ASG kicks in after the watchdog check was last done.

Related tickets

Related Articles:

Related Articles:

Page:

"Hello World" Installation-GitHub
Page:

Beginner's Guide to HySDS
Page:

Cluster Setup - Installation-GitHub
Page:

Create Auto-Scaling Fleet Queue
Page:

Deploy PGE's onto Cluster
Page:

Generic Trigger Rules for Mozart failed jobs
Page:

Hello Dataset- Installation- Github
Page:

How to use X-Fields in Mozart Swagger API
Page:

HySDS Level 1 Overview
Page:

HySDS Level 2 Overview
Page:

HySDS Log File Overview
Page:

Mozart

Have Questions? Ask a HySDS Developer:

Anyone can join our public Slack channel to learn more about HySDS. JPL employees can join #HySDS-Community

JPLers can also ask HySDS questions at Stack Overflow Enterprise

Page Information:

Page Information:

Was this page useful?

Yes No

Contribution History:

Topher Allen (1541 days ago)
Kate Sammons (1834 days ago)

Subject Matter Expert:

@Hook Hua

Find an Error?

Is this document outdated or inaccurate? Please contact the assigned Page Maintainer:

@Hook Hua

HySDS-Core

Mozart Watchdog Notifications

(1) Processing is alive

Slackbot notification

Suggested Mitigation Plan for this alert

Related tickets

(2) Processing job timeouts

Slackbot notification

Suggested Mitigation Plan for this alert

Related tickets

(3) Processing task timeouts

Slackbot notification

Suggested Mitigation Plan for this alert

Related tickets

(4) Worker is alive

Slackbot notification

Suggested Mitigation Plan for this alert

Related tickets

(5) Queue churn

Slackbot notification

Suggested Mitigation Plan for this alert

Related tickets