SDSWatch Metrics

Page Navigation:

Page Navigation:

 


Confidence Level High  This article been formally reviewed and is signed off on by a relevant subject matter expert. 

Confidence Level High  This article been formally reviewed and is signed off on by a relevant subject matter expert. 

Intro

SDSWatch is a logging mechanism targeting the collection of insights and metrics.


Use cases

Metrics collection inside PGE processing steps

The basic use case of arbitrary PGEs output key essential log metrics for SDSWatch to scoop up for analytics. By having a basic schema of csv-style log file output by PGEs, any component can be able to output metrics that will be streamed back for analytics in the cloud. This would then not require components to know anything about the cloud. PGEs running in a fleet of ASGs could each output key/value in csv files in their own workers. sdswatch streams back metrics to ELK stack for analysis.

This approach enables even legacy fortran algorithms to emit metrics that can be scooped up by SDSWatch and stream to aggregation for analytics.

An example SDSWatch key/value may also include:

  • 2020-03-03 10:26:00, “192.168.1.250”, “hysds-io-standard_product-s1gunw-topsapp:release-sp-20191121”, “step”, “coregistration”

  • 2020-03-03 10:40:00, “192.168.1.250”, “hysds-io-standard_product-s1gunw-topsapp:release-sp-20191121”, “step”, “phase unwrapping”

Example job to instrument with SDSWatch: job-acquisition_ingest-scihub:release-20190710

Metrics collection from system crons

For some adaptations of hsyds-core, periodic submission of jobs are done by lambda and crons. For crons, they usually leave log files that are invisible to Metrics. By simply allowing these old cron scripts to emit SDSWatch CSV logs, SDSWatch should be able to monitor for line updates and ship them to elasticsearch for analysis.

An important use case within here is the ability to capture cron script warning or failures. Without SDSWatch, any errors would not be known unless logged into the system.

Metrics collection in critical HySDS components

Similar to the previous use case, but essential for monitoring HySDS core components. For example:

Metrics collection on internal states of PGEs, e.g. short-circuiting existing datasets in topsApp

job-standard_product-s1gunw-topsapp:release-sp-20191121 currently takes on average 65-minutes to process one S1-GUNW data product. At the beginning of topsApp PGE, it checks if the expected output already exists, if so, it exits immediately. This use case for sdswatch would emit sdswatch metric for dataset existence and short-circuiting to be reporting.

Metrics analysis via Kibana

Enable the visualization of key/value metrics first from aggregate across all workers and components. This enables viewing statistics of key/values such as min/mean/max values of a reported metric. Then facet into one compute node into a worker to see metrics just for that worker Then facet onto one single metric to see its value reported over time.

Dashboard panels to support faceting:

  • keys over time

  • values over time

  • table distribution of keys

  • table distribution of values

  • table distribution of IP addresses

  • table distribution of component IDs

Metrics collection for Verdi events, e.g. Hariki

Verdi job worker has many states that can be reported to SDSWatch for analysis in real-time. e.g. hariki, job states, etc. verdi could update SDSWatch logs for enabling insights into job worker events.


Design

SDSWatch client (on compute node)

The SDSWatch client monitors SDSWatch log insights and ships them to the server.

There are three SDSWatch log types that the client-side Logstash handles:

1. Key-Value SDSWatch log type

  • Full-schema logs for system developers working on HySDS core components (e.g., Verdi, Mozart, GRQ, etc).

  • Log schema: <timestamp ISO 8601>, <host>, <source type>, <source id>, <metric key>, <metric value>

  • File location and format: /home/ops/mozart/log/sdswatch/*.fullkv.sdswatch.log

    • Generic logs are typically in the log directory managed by Supervisord

  • Example log:

    2020-04-09T01:38:22+0000, http://e-jobs.aria.hysds.io:15672, rabbitmq, spyddder-sling-extract-asf, state, running 2020-04-09T01:38:22+0000, http://e-jobs.aria.hysds.io:15672, rabbitmq, spyddder-sling-extract-asf, ready, 1 2020-04-09T01:38:22+0000, http://e-jobs.aria.hysds.io:15672, rabbitmq, spyddder-sling-extract-asf, unacked, 0 2020-04-09T01:38:22+0000, http://e-jobs.aria.hysds.io:15672, rabbitmq, standard_product-s1gunw-topsapp-pleiades, state, running 2020-04-09T01:38:22+0000, http://e-jobs.aria.hysds.io:15672, rabbitmq, standard_product-s1gunw-topsapp-pleiades, ready, 51 2020-04-09T01:38:22+0000, http://e-jobs.aria.hysds.io:15672, rabbitmq, standard_product-s1gunw-topsapp-pleiades, unacked, 131

2. Minimal Key-Value SDSWatch log type

  • Simplified logs for PGE developers

  • Log schema: <timestamp ISO 8601>, <key>, <value>

  • File location and format: /data/work/jobs/<year>/<month>/<day>/<hour>/<minute>/<source_id>/<source_type>.pge.sdswatch.log

    • PGE logs are typically in the data directory of Verdi job worker.

  • Example log:

    '2020-05-25 01:52:40.569', key1, value1 '2020-05-25 01:52:40.570', auxiliary_key, value3 '2020-05-25 01:52:40.570', key2, value2

3. Full Dictionary SDSWatch log type

  • Most powerful log format for aggregating key-value pairs into single log lines

  • Log schema: <timestamp ISO 8601>, <key1>=<value1>, <key2>=<value2>, <keyN>=<valueN>

  • File location and format: /home/ops/mozart/log/sdswatch/*.fulldict.sdswatch.log

  • Example log:

    '2020-05-25 01:52:40.569', e-jobs.aria.hysds.io, rabbitmq, spyddder-sling-extract-asf, service=elasticsearch activestate=running activestatetimestamp=<TS> '2020-05-25 01:52:40.570', e-jobs.aria.hysds.io, rabbitmq, spyddder-sling-extract-asf, service=rabbitmq-server activestate=running activestatetimestamp=<TS> '2020-05-25 01:52:40.570', e-jobs.aria.hysds.io, rabbitmq, spyddder-sling-extract-asf, service=redis activestate=running activestatetimestamp=<TS>

SDSWatch server

The SDSWatch server collects and provides analytics of metrics.

Redis is leveraged as broker transport for delivery to Elasticsearch. Using Elasticsearch alone doesn’t scale well when there is a large volume of logs coming in, and Redis is used to solve that problem. This follows how Verdi already uses Logstash to scale up delivery to Elasticsearch.

Directory structure

Client

Server


Requirements

TODO: Are these requirements meant as guides for initial (or future) development? They seem to describe design assumptions and constraints.

SDSWatch shall be scalable along with Verdi workers.

Log schema for ELK stack

SDSWatch shall monitor for line updates generated by components to monitor.

The log file shall have the following format:

Stream metrics to ES/Kibana

  • SDSWatch shall be able to run standalone outside of a cloud vendor.

  • SDSWatch shall conform to the ELK stack components.

  • SDSWatch shall stream metrics to Elasticsearch.

  • Visualization of key/value of components shall be enabled via Kibana.

Stream metrics to AWS CloudWatch

SDSWatch shall stream to Amazon CloudWatch for AWS deployments


Implementation

Main developer: @vitrandao

Code repository: https://github.com/hysds/hysds-sdswatch

Client-side Logstash

Verdi job worker

  • Verdi job worker updates logstash.conf with new job work directory’s *.sdswatch.log for each job iteration

Service container 

  • Docker run needs to expose -v bindings for the redis port and kibana port

  • Logstash

    • Redis input plugin to read from redis db

    • ES output plugin to save into Elasticsearch


Installation

Clone the GitHub repository to the local machine

  1. git clone https://github.com/hysds/hysds-sdswatch.git

  2. cd hysds-sdswatch

Set up a SDSWatch server

  1. Open up a new terminal instance.

  2. Connect to your remote server:

  3. Go back to your local hysds-sdswatch directory.

  4. Secure copy onserver/sdswatch-server directory to the tmp directory in the remote server:

    Note that /export/home/hysdsops is the home directory.

  5. Go back to your remote server.

  6. Create an empty data directory to save Elasticsearch documents:

  7. Give a read and write permission for the data directory:

  8. Run the server:

Set up a SDSWatch client

Since Mamba cluster’s Factotum has no internet connection, a Logstash (7.1.1) image needs to be imported first, in order for Docker to run correctly. 

  1. Open up another terminal instance.

  2. Connect to your remote client:

  3. Go back to your local hysds-sdswatch directory.

  4. Secure copy onclient/sdswatch-server directory to the share directory in the remote client:

    Note that /export/home/hysdsops is the home directory.

  5. Go back to your remote client.

  6. Create an empty data directory to save Logstash history:

  7. Copy logstash.conf to /export/home/hysdsops/verdi/etc/:

  8. Run the client:

If there are logs to ship, you should see a Logstash output similar to the following:

Configure Supervisord to automatically start SDSWatch Client on reboot (optional)

  1. Create a supervisor.d file if it does not exist already:

  2. Add the following configuration to supervisor.d:

3. Activate supervisor.d:

4. Check if sdswatch-client is running correctly:

Troubleshoot


Demonstration

Instrumenting existing PGE code with SDSWatchLogger (Python)

Any metrics saved to file <job_type>.pge.sdswatch.log in parent work directories are scooped up by the SDSWatchAgent running in the Verdi job worker as a background process. This means that PGE developers no longer need to configure Elasticsearch endpoints manually.

Ensure that each log file is named <job_type>.pge.sdswatch.log and placed in the same root work directory as the main module.

Download and install hysds-sdswatch via pip:

Instantiate SDSWatchLogger with SDSWatchLogger.configure_pge_logger(file_dir: strname: str)

Log with SDSWatchLogger.log(key: strvalue: str)

A custom key and its corresponding value are appended to the last two columns of the log file.

Sample output


Future work

Design improvements

On the client-side, Logstash is used instead of Filebeat and it needs to be replaced in the future. Even though Logstash is similar to Filebeat and provides more log processing capability that Filebeat doesn’t have. Filebeat is more lightweight and thus more suitable for the client-side. We’re looking forward to removing logstash in the client-side, and copy the current Logstash configuration in the client-side to the logstash configuration in the server-side.

  • On the client side: Filebeat is installed across multiple compute node, shipping data to the server side

  • On the server side: Logstash receives data from Filebeat, ships it to Elasticsearch database. Kibana is used for visualization.

Tips for migrating the client-side Logstash to Filebeat

  • Understand how the client and server currently work first by playing around with it.

  • I recommend reading all the relevant files since there are not many (you should ignore all the Filebeat files on the client side for now) (tip: starting with the sdswatch-server.sh and sdswatch-client.sh). Try to understand the Logstash configuration file on the client side.

  • When migrating Logstash to Filebeat on the client side, the only thing you need to modify on the server side is the Logstash configuration (just adding a filter block between input block and output block).

  • I recommend trying out Filebeat on your local machine with SDSWatch logs first. Try using Filebeat to scoop up sdswatch logs and send it to console. Then look at the output logs printed in console, and investigate the fields inside the output logs. Then compare it with the assumed input in the current Logstash configuration in SDSWatch-client. Try to play around with Filebeat “add_field” feature and see if you can make the output logs from Filebeat to have the required information.

  • When you figure out how to make the output logs from Filebeat look right, check out filebeat.yml and filebeat-configs that I wrote that are currently in the system or on github. Using these files as a starting point. (Remember to always allow “live reloading” feature so we can always update it during production)

  • I also already wrote a configuration file to create docker container with Filebeat, you may want to log into hysdsops@<your-client-ip-address> and find the directory /export/home/hysdsops/verdi/share/filebeat-sdswatch-client. filebeat-sdswatch-client is similar to sdswatch-client but it’s for filebeat. However, when I ran it, there was an error that I couldn’t figure out why. This problem prevented me from migrating Logstash to Filebeat during my internship. Error when running Filebeat in docker: /usr/local/bin/docker-entrypoint: line 8: exec: filebeat: not found (I asked one of the people in Elastic forum, you may find it helpful)

  • The filebeat relevant files weren’t tested yet so read it with a grain of salt.

Resources

To-dos

Replace Logstash on the client side with Filebeat.
Move Logstash filtering code to the server side.
Try to find a better way to give read and write permission for the server-side docker container to write into /export/home/hysdsops/tmp/sdswatch-server/data.
Install SDSWatch client on all compute nodes.
Test at scale.

Related Articles:

Have Questions? Ask a HySDS Developer:

Anyone can join our public Slack channel to learn more about HySDS. JPL employees can join #HySDS-Community

JPLers can also ask HySDS questions at Stack Overflow Enterprise

Search HySDS Wiki

Page Information:

Was this page useful?

Yes No

Contribution History:

Subject Matter Expert:

@Hook Hua

Find an Error?

Is this document outdated or inaccurate? Please contact the assigned Page Maintainer:

@Hook Hua

Note: JPL employees can also get answers to HySDS questions at Stack Overflow Enterprise: