Watchdogs

Hook Hua edited this page on Aug 23, 2018 · 1 revision


Confidence Level TBD  This article has not been reviewed for accuracy, timeliness, or completeness. Check that this information is valid before acting on it.

Confidence Level TBD  This article has not been reviewed for accuracy, timeliness, or completeness. Check that this information is valid before acting on it.

Watchdog Alerting/Monitoring Intro

HySDS core functionalities include monitoring and alerting services. For exampole, it can monitor if a queue is inactive (not processing the waiting jobs) or absence of a successful completion of a particular job type within certain period of time and send alert through email or slack channel which allows users to take necessary action if needed.

Queue Monitoring

It monitors if any queue is inactive or a job is hanging there for long time.

Source code

check_queue_periodicity.py

Command

python check_queue_periodicity.py <rabbitmq_account>:<port_no> periodicity_time -s <slack_channel> -e <email_address>

How it works

  • User passes Rabbit MQ account information through command line

  • Using Rabbit MQ API call, get all the queue informations and filter out all the unwanted queues such as any queue starts with ‘celery’ or any queue for which there is no message waiting in the queue

  • For the queues selected, get the following information

  •  

    • How many jobs are waiting in the queue (‘message_ready’)

  •  

    • How many jobs that are presently running (‘unack’)

  • If jobs_waiting>0 and jobs_running=0, send an alert for “possibly queue inactive”

  • If jobs_running>0, query Elasticsearch to find out the information about last job started. If job started longer than periodicity time (passed by user through command line, presently set for 4 hours), send an alert about ‘possible job stacked’

Job Monitoring

It monitors if a particular job type does not have any successful job completion within a ceratin period of time

Source Code

check_job_periodicity_combined.py

Command

python check_job_periodicity_combined.py <job_type> -s <slack_channel> -e <email_address>

Jobs presently monitoring

  • job-sling

  • job-s1_orbit_crawler

  • job-s1_calibration_crawler

  • job-csk_scrape

  • job-csk_incoming

  • job-spyddder-extract (for CSK)

How it works

  • User passes job type (with release information) and time limit through command line

  • Check for “Successfully Completed” jobs: Query Elastic search, for information about the last job successfully completed.

  •  

    • If the last job successfully completed time is within the time range, NO alert is sent. The program exits.

  •  

    • If no information is found and if the job type is NOT ‘csk-incoming’ or ‘csk extract’, add in the error report that “No successfully completed job found” and go to next step “Check for Failed Jobs”.

  •  

    • If the last job “successfully completed” time is beyond the time range and if the job type is NOT ‘csk-incoming’ or ‘csk extract’, record the last successfully completed jobs end time in the error report and go to next step “Check for Failed Jobs”.

  •  

    • If the job-type is ‘csk-incoming’ or ‘csk extract’, just go to next step “Check for Failed Jobs”. (“csk-incoming” and “csk-extract” job does not run unless there is a new csk dataset available, so it is not an error)

  • Check for “Failed” jobs: Query Elastic search, for information about the last job failed.

  •  

    • If no information is found and if the job type is NOT ‘csk-incoming’ or ‘csk extract’, add in the error report that “No failed job found” for that job type.

  •  

    • If the job type is ‘csk-incoming’ or ‘csk extract’ and the last job failed time is BEYOND the time range, no alert is sent and the program exists.

  •  

    • Otherwise, get the information about the last job that failed, add to the error report (This is true for job type ‘csk-incoming’ and ‘csk extract’ too as long as the last job failed time is WITHIN the time range)

  • Send the alert: Create a detailed error report containing all the messages added to it and send the alert to the slack channel.

 


Related Articles:

Related Articles:

Have Questions? Ask a HySDS Developer:

Anyone can join our public Slack channel to learn more about HySDS. JPL employees can join #HySDS-Community

JPLers can also ask HySDS questions at Stack Overflow Enterprise

Search HySDS Wiki

Page Information:

Page Information:

Was this page useful?

Yes No

Contribution History:

Subject Matter Expert:

@Marjorie Lucas

@Lan Dang

@Hook Hua

Find an Error?

Is this document outdated or inaccurate? Please contact the assigned Page Maintainer:

@Marjorie Lucas

Note: JPL employees can also get answers to HySDS questions at Stack Overflow Enterprise: