Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

HySDS UI has a minimal dashboard in Kibana to display current status for its services (supervisor / systemd)

Kibana is running on https://<MOZART_IP_ADDRESS>/metrics/app/home

Accessing the HySDS Cluster Health page

...

Image Added

HySDS Cluster health dashboard

...

The table above [Fig 4] shows the status of systemd services (Elasticsearch, redis, rabbitmq, etc.), check the systemd.SubState & systemd.ActiveStateTimestamp columns for the current status and last updated.

Image Added

The table above [Fig 5] shows the status of supervisord services (celery workers, rest APIs, logstash, etc.), check the supervisord.status & supervisord.uptime columns for the current status and uptime.

Cluster Health backend

There are 2 supervisord processes running; they check for:

  • supervisord services

    • celery workers:

      • job workers (factotum)

      • user rules

      • orchestrator

    • Rest APIs (grq2, mozart, pele, etc.)

    • Logstash

    • docker registry

    • sdswatch

    • Kibana

    • Filebeats

    • worker timeouts

  • systemd services

    • Elasticsearch (grq, mozart & metrics)

    • Redis (mozart & metrics)

    • Rabbitmq

    • httpd (proxy)

The script(s) will periodically check for service statuses every minute

Code Block
[program:watch_supervisord_services]
directory={{ OPS_HOME }}
command={{ OPS_HOME }}/mozart/bin/watch_supervisord_services.py --host mozart
process_name=%(program_name)s
priority=999
numprocs=1
numprocs_start=0
redirect_stderr=true
startretries=0
stdout_logfile=%(here)s/../log/%(program_name)s.fulldict.sdswatch.log
stdout_logfile_maxbytes=100MB
stdout_logfile_backups=10
startsecs=10

[program:watch_systemd_services]
directory={{ OPS_HOME }}
command={{ OPS_HOME }}/mozart/bin/watch_systemd_services.py --host mozart -s elasticsearch redis rabbitmq-server httpd
process_name=%(program_name)s
priority=999
numprocs=1
numprocs_start=0
redirect_stderr=true
startretries=0
stdout_logfile=%(here)s/../log/%(program_name)s.fulldict.sdswatch.log
stdout_logfile_maxbytes=100MB
stdout_logfile_backups=10
startsecs=10

Future plans for the Cluster Health page

The current page is very bare boned, but some ideas for the future are:

  • Color-coating the status column in the table with

    • RED: service is down

    • GREEN: service is running

    • YELLOW: Service is starting

  • Integration with Cloudwatch logs (if possible) or just a simple link

  • Graphs & visualization