/
HySDS Cluster Health Page

HySDS Cluster Health Page

HySDS UI has a minimal dashboard in Kibana to display current status for its services (supervisor / systemd)

Kibana is running on https://<MOZART_IP_ADDRESS>/metrics/app/home

Accessing the HySDS Cluster Health page

Fig 1. Link to Kibana in HySDS UI
Fig 2. Sidebar display button and link to the Kibana dashboards

 

 

 

HySDS Cluster health dashboard

The table above [Fig 4] shows the status of systemd services (Elasticsearch, redis, rabbitmq, etc.), check the systemd.SubState & systemd.ActiveStateTimestamp columns for the current status and last updated.

 

The table above [Fig 5] shows the status of supervisord services (celery workers, rest APIs, logstash, etc.), check the supervisord.status & supervisord.uptime columns for the current status and uptime.

 

Cluster Health backend

There are 2 supervisord processes running; they check for:

  • supervisord services

    • celery workers:

      • job workers (factotum)

      • user rules

      • orchestrator

    • Rest APIs (grq2, mozart, pele, etc.)

    • Logstash

    • docker registry

    • sdswatch

    • Kibana

    • Filebeats

    • worker timeouts

  • systemd services

    • Elasticsearch (grq, mozart & metrics)

    • Redis (mozart & metrics)

    • Rabbitmq

    • httpd (proxy)

The script(s) will periodically check for service statuses every minute

[program:watch_supervisord_services] directory={{ OPS_HOME }} command={{ OPS_HOME }}/mozart/bin/watch_supervisord_services.py --host mozart process_name=%(program_name)s priority=999 numprocs=1 numprocs_start=0 redirect_stderr=true startretries=0 stdout_logfile=%(here)s/../log/%(program_name)s.fulldict.sdswatch.log stdout_logfile_maxbytes=100MB stdout_logfile_backups=10 startsecs=10 [program:watch_systemd_services] directory={{ OPS_HOME }} command={{ OPS_HOME }}/mozart/bin/watch_systemd_services.py --host mozart -s elasticsearch redis rabbitmq-server httpd process_name=%(program_name)s priority=999 numprocs=1 numprocs_start=0 redirect_stderr=true startretries=0 stdout_logfile=%(here)s/../log/%(program_name)s.fulldict.sdswatch.log stdout_logfile_maxbytes=100MB stdout_logfile_backups=10 startsecs=10

 

Future plans for the Cluster Health page

The current page is very bare boned, but some ideas for the future are:

  • Color-coating the status column in the table with

    • RED: service is down

    • GREEN: service is running

    • YELLOW: Service is starting

  • Integration with Cloudwatch logs (if possible) or just a simple link

  • Graphs & visualization (if possible)

  • Potentially moving it to a proper frontend/React application with better UI/UX

    • won't be limited by Kibana

 

Note: JPL employees can also get answers to HySDS questions at Stack Overflow Enterprise: