HySDS UI has a minimal dashboard in Kibana to display current status for its services (supervisor / systemd)

Kibana is running on https://<MOZART_IP_ADDRESS>/metrics/app/home

Accessing the HySDS Cluster Health page

...

Image Added

HySDS Cluster health dashboard

...

The table above [Fig 4] shows the status of systemd services (Elasticsearch, redis, rabbitmq, etc.), check the systemd.SubState & systemd.ActiveStateTimestamp columns for the current status and last updated.

Image Added

The table above [Fig 5] shows the status of supervisord services (celery workers, rest APIs, logstash, etc.), check the supervisord.status & supervisord.uptime columns for the current status and uptime.

Cluster Health backend

There are 2 supervisord processes running; they check for:

supervisord services
- celery workers:
  - job workers (factotum)
  - user rules
  - orchestrator
- Rest APIs (grq2, mozart, pele, etc.)
- Logstash
- docker registry
- sdswatch
- Kibana
- Filebeats
- worker timeouts
systemd services
- Elasticsearch (grq, mozart & metrics)
- Redis (mozart & metrics)
- Rabbitmq
- httpd (proxy)

The script(s) will periodically check for service statuses every minute

Code Block

[program:watch_supervisord_services]
directory={{ OPS_HOME }}
command={{ OPS_HOME }}/mozart/bin/watch_supervisord_services.py --host mozart
process_name=%(program_name)s
priority=999
numprocs=1
numprocs_start=0
redirect_stderr=true
startretries=0
stdout_logfile=%(here)s/../log/%(program_name)s.fulldict.sdswatch.log
stdout_logfile_maxbytes=100MB
stdout_logfile_backups=10
startsecs=10

[program:watch_systemd_services]
directory={{ OPS_HOME }}
command={{ OPS_HOME }}/mozart/bin/watch_systemd_services.py --host mozart -s elasticsearch redis rabbitmq-server httpd
process_name=%(program_name)s
priority=999
numprocs=1
numprocs_start=0
redirect_stderr=true
startretries=0
stdout_logfile=%(here)s/../log/%(program_name)s.fulldict.sdswatch.log
stdout_logfile_maxbytes=100MB
stdout_logfile_backups=10
startsecs=10

Future plans for the Cluster Health page

The current page is very bare boned, but some ideas for the future are:

Color-coating the status column in the table with
- RED: service is down
- GREEN: service is running
- YELLOW: Service is starting
Integration with Cloudwatch logs (if possible) or just a simple link
Graphs & visualization

Versions Compared

Old Version 1

New Version 2

Key

Accessing the HySDS Cluster Health page

HySDS Cluster health dashboard

Cluster Health backend

Future plans for the Cluster Health page

Page Comparison

Versions Compared

Old Version 1

New Version 2

Key

Accessing the HySDS Cluster Health page

HySDS Cluster health dashboard

Cluster Health backend

Future plans for the Cluster Health page