Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Hook Hua edited this page on Aug 28, 2018 · 7 revisions

Page Navigation:

Table of Contents


(blue star) Confidence Level TBD  This article has not been reviewed for accuracy, timeliness, or completeness. Check that this information is valid before acting on it.

See Watchdogs for implementation notes.

Separate from Mozart, HySDS employs cron scripts that check Mozart periodically and sends slackbot notifications to the team when specific conditions are detected. Watchdog slackbot notifications are sent to a Slack team channel for the following conditions:

(1) Processing is alive

Monitors for job types that have not successfully completed in timeout period.

Slackbot notification

  • Job Status checking for job type

    • The last successfully completed job of type was N-hours ago.

    • The last failed job of type was N-hours ago.

Suggested Mitigation Plan for this alert

  • (to be filled out)

Related tickets

  • [closed] update watchdog script slackbot to show last few failed jobs instead of last successful job #498

  • [closed] add watchdog script to monitor for mozart jobs that have not successfully completed in timeout period #491

(2) Processing job timeouts

Monitors for job-started and job-offline timeouts.

Slackbot notification

  • Queue Status Alert:

    • Alert: Possible Job hanging in the queue

Suggested Mitigation Plan for this alert

  • (to be filled out)

Related tickets

  • [closed] add watchdog script to monitor for job-started and job-offline timeouts #232, which tagged resource=job with tag=timedout

  • TODO: Need to change to variant that emits slackbot notifications.

  • TODO: check for stale job-start and job-offline , and send notification to slackbot notification For each (job-start + job-offline) jobs Look up that job’s soft time limit Check if (time started + soft time limit + margin) > now. That is if the time limit with some margin (e.g. 5-mins) has elapsed. If it has, the job should have been killed by celery/verdi by now. But if not, the job as shown in Mozart will be stale and stuck in job-started state. Send notifications for these.

(3) Processing task timeouts

Monitors for celery task-started timeouts.

Slackbot notification

  • (to be filled out)

Suggested Mitigation Plan for this alert

  • (to be filled out)

Related tickets

  • [closed] add watchdog script to monitor for task-started timeouts #231

  • TODO: Similar to checking for stale job-started timeouts

(4) Worker is alive

Monitors for worker-heartbeat timeouts.

Slackbot notification

  • (to be filled out)

Suggested Mitigation Plan for this alert

  • (to be filled out)

Related tickets

  • [open] add watchdog script to monitor for worker-heartbeat timeouts #230

  • TODO: Change to emit slackbot notification instead of tagging resource=worker in ES

(5) Queue churn

Monitors for if have active workers draining each queue. Checks if workers exists that drain each non-empty queue

Slackbot notification

  • Queue Status Alert:

    • Alert : No job running though jobs are waiting in the queue!!

Suggested Mitigation Plan for this alert

  • This error may fix itself if ASG kicks in after the watchdog check was last done.

Related tickets


(lightbulb) Have Questions? Ask a HySDS Developer:

Anyone can join our public Slack channelto learn more about HySDS. JPL employees can join #HySDS-Community

(blue star)

JPLers can also ask HySDS questions atStack Overflow Enterprise

(blue star)

Live Search
placeholderSearch HySDS Wiki