Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Page Navigation:

Table of Contents


(blue star) Confidence Level High  This article been formally reviewed and is signed off on by a relevant subject matter expert. 

Intro

SDSWatch is a logging mechanism targeting the collection of insights and metrics.


Use cases

Metrics collection inside PGE processing steps

The basic use case of arbitrary PGEs output key essential log metrics for SDSWatch to scoop up for analytics. By having a basic schema of csv-style log file output by PGEs, any component can be able to output metrics that will be streamed back for analytics in the cloud. This would then not require components to know anything about the cloud. PGEs running in a fleet of ASGs could each output key/value in csv files in their own workers. sdswatch streams back metrics to ELK stack for analysis.

This approach enables even legacy fortran algorithms to emit metrics that can be scooped up by SDSWatch and stream to aggregation for analytics.

An example SDSWatch key/value may also include:

  • 2020-03-03 10:26:00, “192.168.1.250”, “hysds-io-standard_product-s1gunw-topsapp:release-sp-20191121”, “step”, “coregistration”

  • 2020-03-03 10:40:00, “192.168.1.250”, “hysds-io-standard_product-s1gunw-topsapp:release-sp-20191121”, “step”, “phase unwrapping”

Example job to instrument with SDSWatch: job-acquisition_ingest-scihub:release-20190710

Metrics collection from system crons

For some adaptations of hsyds-core, periodic submission of jobs are done by lambda and crons. For crons, they usually leave log files that are invisible to Metrics. By simply allowing these old cron scripts to emit SDSWatch CSV logs, SDSWatch should be able to monitor for line updates and ship them to elasticsearch for analysis.

An important use case within here is the ability to capture cron script warning or failures. Without SDSWatch, any errors would not be known unless logged into the system.

Metrics collection in critical HySDS components

Similar to the previous use case, but essential for monitoring HySDS core components. For example:

Metrics collection on internal states of PGEs, e.g. short-circuiting existing datasets in topsApp

job-standard_product-s1gunw-topsapp:release-sp-20191121 currently takes on average 65-minutes to process one S1-GUNW data product. At the beginning of topsApp PGE, it checks if the expected output already exists, if so, it exits immediately. This use case for sdswatch would emit sdswatch metric for dataset existence and short-circuiting to be reporting.

Metrics analysis via Kibana

Enable the visualization of key/value metrics first from aggregate across all workers and components. This enables viewing statistics of key/values such as min/mean/max values of a reported metric. Then facet into one compute node into a worker to see metrics just for that worker Then facet onto one single metric to see its value reported over time.

Dashboard panels to support faceting:

  • keys over time

  • values over time

  • table distribution of keys

  • table distribution of values

  • table distribution of IP addresses

  • table distribution of component IDs

Metrics collection for Verdi events, e.g. Hariki

Verdi job worker has many states that can be reported to SDSWatch for analysis in real-time. e.g. hariki, job states, etc. verdi could update SDSWatch logs for enabling insights into job worker events.


Design

SDSWatch client (on compute node)

The SDSWatch client monitors SDSWatch log insights and ships them to the server.

There are three SDSWatch log types that the client-side Logstash handles:

1. Key-Value SDSWatch log type

  • Full-schema logs for system developers working on HySDS core components (e.g., Verdi, Mozart, GRQ, etc).

  • Log schema: <timestamp ISO 8601>, <host>, <source type>, <source id>, <metric key>, <metric value>

  • File location and format: /home/ops/mozart/log/sdswatch/*.fullkv.sdswatch.log

    • Generic logs are typically in the log directory managed by Supervisord

  • Example log:

    Code Block
    2020-04-09T01:38:22+0000, http://e-jobs.aria.hysds.io:15672, rabbitmq, spyddder-sling-extract-asf, state, running
    2020-04-09T01:38:22+0000, http://e-jobs.aria.hysds.io:15672, rabbitmq, spyddder-sling-extract-asf, ready, 1
    2020-04-09T01:38:22+0000, http://e-jobs.aria.hysds.io:15672, rabbitmq, spyddder-sling-extract-asf, unacked, 0
    2020-04-09T01:38:22+0000, http://e-jobs.aria.hysds.io:15672, rabbitmq, standard_product-s1gunw-topsapp-pleiades, state, running
    2020-04-09T01:38:22+0000, http://e-jobs.aria.hysds.io:15672, rabbitmq, standard_product-s1gunw-topsapp-pleiades, ready, 51
    2020-04-09T01:38:22+0000, http://e-jobs.aria.hysds.io:15672, rabbitmq, standard_product-s1gunw-topsapp-pleiades, unacked, 131

2. Minimal Key-Value SDSWatch log type

  • Simplified logs for PGE developers

  • Log schema: <timestamp ISO 8601>, <key>, <value>

  • File location and format: /data/work/jobs/<year>/<month>/<day>/<hour>/<minute>/<source_id>/<source_type>.pge.sdswatch.log

    • PGE logs are typically in the data directory of Verdi job worker.

  • Example log:

    Code Block
    '2020-05-25 01:52:40.569', key1, value1
    '2020-05-25 01:52:40.570', auxiliary_key, value3
    '2020-05-25 01:52:40.570', key2, value2

3. Full Dictionary SDSWatch log type

  • Simplified logs for PGE developersMost powerful log format for aggregating key-value pairs into single log lines

  • Log schema: <timestamp ISO 8601>, <key1>=<value1>, <key2>=<value2>, <keyN>=<valueN>

  • File location and format: /home/ops/mozart/log/sdswatch/*.fulldict.sdswatch.log

  • Example log:

    Code Block
    '2020-05-25 01:52:40.569', e-jobs.aria.hysds.io, rabbitmq, spyddder-sling-extract-asf, service=elasticsearch activestate=running activestatetimestamp=<TS>
    '2020-05-25 01:52:40.570', e-jobs.aria.hysds.io, rabbitmq, spyddder-sling-extract-asf, service=rabbitmq-server activestate=running activestatetimestamp=<TS>
    '2020-05-25 01:52:40.570', e-jobs.aria.hysds.io, rabbitmq, spyddder-sling-extract-asf, service=redis activestate=running activestatetimestamp=<TS>

SDSWatch server

The SDSWatch server collects and provides analytics of metrics.

Redis is leveraged as broker transport for delivery to Elasticsearch. Using Elasticsearch alone doesn’t scale well when there is a large volume of logs coming in, and Redis is used to solve that problem. This follows how Verdi already uses Logstash to scale up delivery to Elasticsearch.

Directory structure

Client

Code Block
verdi/
  share/
    sdswatch-client/
      data/
      sdswatch-client.sh
  etc/
    logstash.conf
    filebeat.yml
    filebeat-configs/

Server

Code Block
tmp/
  sdswatch-server/
    data/
    conf/
      logstash.conf
    load-kibana-dashboard/
      Dockerfile
      load-kibana-dashboard.sh
      sdswatch-dashboard.json
      wait-for.it.sh
    docker-compose.yml
    sdswatch-server.sh
    .env

Requirements

Info

TODO: Are these requirements meant as guides for initial (or future) development? They seem to describe design assumptions and constraints.

SDSWatch shall be scalable along with Verdi workers.

Log schema for ELK stack

SDSWatch shall monitor for line updates generated by components to monitor.

The log file shall have the following format:

  • delimiter: comma

  • schema: timestamp (ISO 8601), host, source type, source id, metric key, metric value

    • the values of the schema tokens should be quoted to allow for commas within the quotes.

  • inspired by commercial Splunk’s metrics approach. see https://docs.splunk.com/Documentation/Splunk/8.0.3/Metrics/Overview

  • example:

    Code Block
    2020-04-09T01:38:22+0000 , http://e-jobs.aria.hysds.io:15672 , rabbitmq , spyddder-sling-extract-asf , state, running
    2020-04-09T01:38:22+0000 , http://e-jobs.aria.hysds.io:15672 , rabbitmq , spyddder-sling-extract-asf , ready, 1
    2020-04-09T01:38:22+0000 , http://e-jobs.aria.hysds.io:15672 , rabbitmq , spyddder-sling-extract-asf , unacked, 0
    2020-04-09T01:38:22+0000 , http://e-jobs.aria.hysds.io:15672 , rabbitmq , standard_product-s1gunw-topsapp-pleiades , state, running
    2020-04-09T01:38:22+0000 , http://e-jobs.aria.hysds.io:15672 , rabbitmq , standard_product-s1gunw-topsapp-pleiades , ready, 51
    2020-04-09T01:38:22+0000 , http://e-jobs.aria.hysds.io:15672 , rabbitmq , standard_product-s1gunw-topsapp-pleiades , unacked, 131

Stream metrics to ES/Kibana

  • SDSWatch shall be able to run standalone outside of a cloud vendor.

  • SDSWatch shall conform to the ELK stack components.

  • SDSWatch shall stream metrics to Elasticsearch.

  • Visualization of key/value of components shall be enabled via Kibana.

Stream metrics to AWS CloudWatch

SDSWatch shall stream to Amazon CloudWatch for AWS deployments


Implementation

Main developer: vitrandao

Code repository: https://github.com/hysds/hysds-sdswatch

Client-side Logstash

Verdi job worker

  • Verdi job worker updates logstash.conf with new job work directory’s *.sdswatch.log for each job iteration

Service container 

  • Docker run needs to expose -v bindings for the redis port and kibana port

  • Logstash

    • Redis input plugin to read from redis db

    • ES output plugin to save into Elasticsearch


Installation

Clone the GitHub repository to the local machine

  1. git clone https://github.com/hysds/hysds-sdswatch.git

  2. cd hysds-sdswatch

Set up a SDSWatch server

  1. Open up a new terminal instance.

  2. Connect to your remote server:

    Code Block
    ssh -i <path/to/identity_key_file> hysdsops@<server-ip-address>
  3. Go back to your local hysds-sdswatch directory.

  4. Secure copy onserver/sdswatch-server directory to the tmp directory in the remote server:

    Code Block
    scp -i <path/to/identity_key_file> -r onserver hysdsops@<server-ip-address>:/export/home/hysdsops/tmp/

    Note that /export/home/hysdsops is the home directory.

  5. Go back to your remote server.

  6. Create an empty data directory to save Elasticsearch documents:

    Code Block
    mkdir -p ~/tmp/sdswatch-server/data
  7. Give a read and write permission for the data directory:

    Code Block
    chmod 777 -R ~/tmp/sdswatch-server/data
  8. Run the server:

    Code Block
    sh ~/tmp/sdswatch-server/sdswatch-server.sh

Set up a SDSWatch client

Note

Since Mamba cluster’s Factotum has no internet connection, a Logstash (7.1.1) image needs to be imported first, in order for Docker to run correctly. 

  1. Open up another terminal instance.

  2. Connect to your remote client:

    Code Block
    ssh -i <path/to/identity_key_file> hysdsops@<client-ip-address>
  3. Go back to your local hysds-sdswatch directory.

  4. Secure copy onclient/sdswatch-server directory to the share directory in the remote client:

    Code Block
    scp -i <path/to/identity_key_file> -r onclient hysdsops@<server-ip-address>:/export/home/hysdsops/verdi/share/

    Note that /export/home/hysdsops is the home directory.

  5. Go back to your remote client.

  6. Create an empty data directory to save Logstash history:

    Code Block
    mkdir -p ~/verdi/share/sdswatch-client/data
  7. Copy logstash.conf to /export/home/hysdsops/verdi/etc/:

    Code Block
    cp ~/verdi/share/sdswatch-client/logstash.conf /export/home/hysdsops/verdi/etc/
  8. Run the client:

    Code Block
    sh ~/verdi/share/sdswatch-client/sdswatch-client.sh

If there are logs to ship, you should see a Logstash output similar to the following:

Code Block
{
  "source_id" => "user_rules_job",
  "host" => "https://<rabbitmq-ip-address>:15673",
  "metric_key" => "state",
  "log_path" => "/verdi/rabbitmq_queue_monitor_to_sdswatch-00.sdswatch.log",
  "sdswatch_timestamp" => 2020-05-29T02:30:18.000Z,
  "metric_value_float" => -1,
  "@version" => "1",
  "metric_value_string" => "running",
  "message" => "2020-05-29T02:30:18+00:00 , https://<rabbitmq-ip-address>:15673 , rabbitmq.queue , user_rules_job , state, running",
  "source_type" => "rabbitmq.queue"
}

Configure Supervisord to automatically start SDSWatch Client on reboot (optional)

  1. Create a supervisor.d file if it does not exist already:

    Code Block
    vi ~/verdi/etc/supervisor.d
  2. Add the following configuration to supervisor.d:

    Code Block
    [program:sdswatch-client]
    directory=/export/home/hysdsops/verdi/share/sdswatch-client/
    command=/export/home/hysdsops/verdi/share/sdswatch-client/sdswatch-client.sh
    process_name=%(program_name)s-%(process_num)02d
    priority=1
    numprocs=1
    numprocs_start=0
    redirect_stderr=true
    stdout_logfile=%(here)s/../log/%(program_name)s-%(process_num)02d.log
    stdout_logfile_maxbytes=10MB
    stdout_logfile_backups=10
    startsecs=10

3. Activate supervisor.d:

Code Block
supervisorctl start sdswatch-client:sdswatch-client-00

4. Check if sdswatch-client is running correctly:

Code Block
supervisorctl status
supervisorctl reread
supervisorctl update

Troubleshoot

Code Block
run tail -f /export/home/hysdsops/verdi/log/sdswatch-client-00.log

Demonstration

Instrumenting existing PGE code with SDSWatchLogger (Python)

Any metrics saved to file <job_type>.pge.sdswatch.log in parent work directories are scooped up by the SDSWatchAgent running in the Verdi job worker as a background process. This means that PGE developers no longer need to configure Elasticsearch endpoints manually.

Note

Ensure that each log file is named <job_type>.pge.sdswatch.log and placed in the same root work directory as the main module.

Code Block
/data/work/jobs/2020/03/17/07/35/job1/download_type.pge.sdswatch.py
/data/work/jobs/2020/03/17/07/35/job2/processing_type.pge.sdswatch.log

Download and install hysds-sdswatch via pip:

Code Block
pip3 install git+https://github.com/hysds/hysds-sdswatch.git@master
Info

TODO: Is global installation with pip a recommended way to manage dependencies in a system?

Instantiate SDSWatchLogger with SDSWatchLogger.configure_pge_logger(file_dir: strname: str)

Code Block
languagepy
# example_main_module.py
from sdswatch.sdswatchlogger import SDSWatchLogger as sdsw_logger

sdsw_logger.configure_pge_logger("/path/to/job/dir", "example_hello_world")

Log with SDSWatchLogger.log(key: strvalue: str)

A custom key and its corresponding value are appended to the last two columns of the log file.

Code Block
languagepy
# example_auxiliary_module.py
from sdswatch.sdswatchlogger import SDSWatchLogger as sdsw_logger

def download():
  sdsw_logger.log("auxiliary_key", "value3")
Code Block
languagepy
# example_main_module.py
from sdswatch.sdswatchlogger import SDSWatchLogger as sdsw_logger
from example_auxiliary_module import download

sdsw_logger.configure_pge_logger("/path/to/job/dir", "example_hello_world")

if __name__ == "__main__":
  sdsw_logger.log("key1", "value1")
  download()
  sdsw_logger.log("key2", "value2")

Sample output

Code Block
# example_hello_world.pge.sdswatch.log
'2020-05-25 01:52:40.569',key1,value1
'2020-05-25 01:52:40.570',auxiliary_key,value3
'2020-05-25 01:52:40.570',key2,value2

Future work

Design improvements

On the client-side, Logstash is used instead of Filebeat and it needs to be replaced in the future. Even though Logstash is similar to Filebeat and provides more log processing capability that Filebeat doesn’t have. Filebeat is more lightweight and thus more suitable for the client-side. We’re looking forward to removing logstash in the client-side, and copy the current Logstash configuration in the client-side to the logstash configuration in the server-side.

  • On the client side: Filebeat is installed across multiple compute node, shipping data to the server side

  • On the server side: Logstash receives data from Filebeat, ships it to Elasticsearch database. Kibana is used for visualization.

Tips for migrating the client-side Logstash to Filebeat

  • Understand how the client and server currently work first by playing around with it.

  • I recommend reading all the relevant files since there are not many (you should ignore all the Filebeat files on the client side for now) (tip: starting with the sdswatch-server.sh and sdswatch-client.sh). Try to understand the Logstash configuration file on the client side.

  • When migrating Logstash to Filebeat on the client side, the only thing you need to modify on the server side is the Logstash configuration (just adding a filter block between input block and output block).

  • I recommend trying out Filebeat on your local machine with SDSWatch logs first. Try using Filebeat to scoop up sdswatch logs and send it to console. Then look at the output logs printed in console, and investigate the fields inside the output logs. Then compare it with the assumed input in the current Logstash configuration in SDSWatch-client. Try to play around with Filebeat “add_field” feature and see if you can make the output logs from Filebeat to have the required information.

  • When you figure out how to make the output logs from Filebeat look right, check out filebeat.yml and filebeat-configs that I wrote that are currently in the system or on github. Using these files as a starting point. (Remember to always allow “live reloading” feature so we can always update it during production)

  • I also already wrote a configuration file to create docker container with Filebeat, you may want to log into hysdsops@<your-client-ip-address> and find the directory /export/home/hysdsops/verdi/share/filebeat-sdswatch-client. filebeat-sdswatch-client is similar to sdswatch-client but it’s for filebeat. However, when I ran it, there was an error that I couldn’t figure out why. This problem prevented me from migrating Logstash to Filebeat during my internship. Error when running Filebeat in docker: /usr/local/bin/docker-entrypoint: line 8: exec: filebeat: not found (I asked one of the people in Elastic forum, you may find it helpful)

  • The filebeat relevant files weren’t tested yet so read it with a grain of salt.

Resources

To-dos

  •  Replace Logstash on the client side with Filebeat.
  •  Move Logstash filtering code to the server side.
  •  Try to find a better way to give read and write permission for the server-side docker container to write into /export/home/hysdsops/tmp/sdswatch-server/data.
  •  Install SDSWatch client on all compute nodes.
  •  Test at scale.

(lightbulb) Have Questions? Ask a HySDS Developer:

Anyone can join our public Slack channelto learn more about HySDS. JPL employees can join #HySDS-Community

(blue star)

JPLers can also ask HySDS questions atStack Overflow Enterprise

(blue star)

Live Search
placeholderSearch HySDS Wiki