/
HySDS Cluster Health Page

HySDS Cluster Health Page

HySDS UI has a minimal dashboard in Kibana to display current status for its services (supervisor / systemd)

Kibana is running on https://<MOZART_IP_ADDRESS>/metrics/app/home

Accessing the HySDS Cluster Health page

Fig 1. Link to Kibana in HySDS UI
Fig 2. Sidebar display button and link to the Kibana dashboards

 

 

 

HySDS Cluster health dashboard

The table above [Fig 4] shows the status of systemd services (Elasticsearch, redis, rabbitmq, etc.), check the systemd.SubState & systemd.ActiveStateTimestamp columns for the current status and last updated.

 

The table above [Fig 5] shows the status of supervisord services (celery workers, rest APIs, logstash, etc.), check the supervisord.status & supervisord.uptime columns for the current status and uptime.

 

Cluster Health backend

There are 2 supervisord processes running; they check for:

  • supervisord services

    • celery workers:

      • job workers (factotum)

      • user rules

      • orchestrator

    • Rest APIs (grq2, mozart, pele, etc.)

    • Logstash

    • docker registry

    • sdswatch

    • Kibana

    • Filebeats

    • worker timeouts

  • systemd services

    • Elasticsearch (grq, mozart & metrics)

    • Redis (mozart & metrics)

    • Rabbitmq

    • httpd (proxy)

The script(s) will periodically check for service statuses every minute

[program:watch_supervisord_services] directory={{ OPS_HOME }} command={{ OPS_HOME }}/mozart/bin/watch_supervisord_services.py --host mozart process_name=%(program_name)s priority=999 numprocs=1 numprocs_start=0 redirect_stderr=true startretries=0 stdout_logfile=%(here)s/../log/%(program_name)s.fulldict.sdswatch.log stdout_logfile_maxbytes=100MB stdout_logfile_backups=10 startsecs=10 [program:watch_systemd_services] directory={{ OPS_HOME }} command={{ OPS_HOME }}/mozart/bin/watch_systemd_services.py --host mozart -s elasticsearch redis rabbitmq-server httpd process_name=%(program_name)s priority=999 numprocs=1 numprocs_start=0 redirect_stderr=true startretries=0 stdout_logfile=%(here)s/../log/%(program_name)s.fulldict.sdswatch.log stdout_logfile_maxbytes=100MB stdout_logfile_backups=10 startsecs=10

 

Future plans for the Cluster Health page

The current page is very bare boned, but some ideas for the future are:

  • Color-coating the status column in the table with

    • RED: service is down

    • GREEN: service is running

    • YELLOW: Service is starting

  • Integration with Cloudwatch logs (if possible) or just a simple link

  • Graphs & visualization (if possible)

  • Potentially moving it to a proper frontend/React application with better UI/UX

    • won't be limited by Kibana

 

Related content

Upgrade
Upgrade
More like this
Understanding Current Metrics
Understanding Current Metrics
More like this
HySDS GUI's Overview
HySDS GUI's Overview
More like this
Job Workflow in HySDS
Job Workflow in HySDS
More like this
Start, Stop, Restart, Status, Reset, Supervisord
Start, Stop, Restart, Status, Reset, Supervisord
More like this
Lightweight Jobs in Resource Manager
Lightweight Jobs in Resource Manager
More like this
Note: JPL employees can also get answers to HySDS questions at Stack Overflow Enterprise: