Page Navigation:

Page Navigation:

1 Intro
- 1.1 Use cases
- 1.2 Design
  - 1.2.1 SDSWatch client (on compute node)
    - 1.2.1.1 1. Key-Value SDSWatch log type
    - 1.2.1.2 2. Minimal Key-Value SDSWatch log type
    - 1.2.1.3 3. Full Dictionary SDSWatch log type
  - 1.2.2 SDSWatch server
  - 1.2.3 Directory structure
    - 1.2.3.1 Client
    - 1.2.3.2 Server
- 1.3 Requirements
  - 1.3.1 Log schema for ELK stack
  - 1.3.2 Stream metrics to ES/Kibana
  - 1.3.3 Stream metrics to AWS CloudWatch
- 1.4 Implementation
  - 1.4.1 Client-side Logstash
  - 1.4.2 Verdi job worker
  - 1.4.3 Service container
- 1.5 Installation
  - 1.5.1 Clone the GitHub repository to the local machine
  - 1.5.2 Set up a SDSWatch server
  - 1.5.3 Set up a SDSWatch client
  - 1.5.4 Configure Supervisord to automatically start SDSWatch Client on reboot (optional)
  - 1.5.5 Troubleshoot
- 1.6 Demonstration
  - 1.6.1 Instrumenting existing PGE code with SDSWatchLogger (Python)
- 1.7 Future work
  - 1.7.1 Design improvements
  - 1.7.2 Tips for migrating the client-side Logstash to Filebeat
  - 1.7.3 Resources
  - 1.7.4 To-dos

Confidence Level High This article been formally reviewed and is signed off on by a relevant subject matter expert.

Confidence Level High This article been formally reviewed and is signed off on by a relevant subject matter expert.

Intro

SDSWatch is a logging mechanism targeting the collection of insights and metrics.

Use cases

Metrics collection inside PGE processing steps

The basic use case of arbitrary PGEs output key essential log metrics for SDSWatch to scoop up for analytics. By having a basic schema of csv-style log file output by PGEs, any component can be able to output metrics that will be streamed back for analytics in the cloud. This would then not require components to know anything about the cloud. PGEs running in a fleet of ASGs could each output key/value in csv files in their own workers. sdswatch streams back metrics to ELK stack for analysis.

This approach enables even legacy fortran algorithms to emit metrics that can be scooped up by SDSWatch and stream to aggregation for analytics.

An example SDSWatch key/value may also include:

2020-03-03 10:26:00, “192.168.1.250”, “hysds-io-standard_product-s1gunw-topsapp:release-sp-20191121”, “step”, “coregistration”
2020-03-03 10:40:00, “192.168.1.250”, “hysds-io-standard_product-s1gunw-topsapp:release-sp-20191121”, “step”, “phase unwrapping”

Example job to instrument with SDSWatch: job-acquisition_ingest-scihub:release-20190710

Metrics collection from system crons

For some adaptations of hsyds-core, periodic submission of jobs are done by lambda and crons. For crons, they usually leave log files that are invisible to Metrics. By simply allowing these old cron scripts to emit SDSWatch CSV logs, SDSWatch should be able to monitor for line updates and ship them to elasticsearch for analysis.

An important use case within here is the ability to capture cron script warning or failures. Without SDSWatch, any errors would not be known unless logged into the system.

Metrics collection in critical HySDS components

Similar to the previous use case, but essential for monitoring HySDS core components. For example:

Mozart
- orchestrator: how jobs are being routed into the Mozart queues.
- process_events: how component events are streamed back from workers to Mozart ES for Figaro view.
- RabbitMQ Queue to SDSWatch metrics
- RabbitMQ Connections to SDSWatch metrics
Factotum
- workers: tracking states, errors, etc.

Metrics collection on internal states of PGEs, e.g. short-circuiting existing datasets in topsApp

job-standard_product-s1gunw-topsapp:release-sp-20191121 currently takes on average 65-minutes to process one S1-GUNW data product. At the beginning of topsApp PGE, it checks if the expected output already exists, if so, it exits immediately. This use case for sdswatch would emit sdswatch metric for dataset existence and short-circuiting to be reporting.

Metrics analysis via Kibana

Enable the visualization of key/value metrics first from aggregate across all workers and components. This enables viewing statistics of key/values such as min/mean/max values of a reported metric. Then facet into one compute node into a worker to see metrics just for that worker Then facet onto one single metric to see its value reported over time.

Dashboard panels to support faceting:

keys over time
values over time
table distribution of keys
table distribution of values
table distribution of IP addresses
table distribution of component IDs

Metrics collection for Verdi events, e.g. Hariki

Verdi job worker has many states that can be reported to SDSWatch for analysis in real-time. e.g. hariki, job states, etc. verdi could update SDSWatch logs for enabling insights into job worker events.

Design

SDSWatch client (on compute node)

The SDSWatch client monitors SDSWatch log insights and ships them to the server.

There are three SDSWatch log types that the client-side Logstash handles:

1. Key-Value SDSWatch log type

Full-schema logs for system developers working on HySDS core components (e.g., Verdi, Mozart, GRQ, etc).
Log schema: <timestamp ISO 8601>, <host>, <source type>, <source id>, <metric key>, <metric value>
- Schema design is inspired by commercial Splunk’s metrics approach. See https://docs.splunk.com/Documentation/Splunk/8.0.3/Metrics/Overview
File location and format: /home/ops/mozart/log/sdswatch/*.fullkv.sdswatch.log
- Generic logs are typically in the log directory managed by Supervisord
Example log:
2020-04-09T01:38:22+0000, http://e-jobs.aria.hysds.io:15672, rabbitmq, spyddder-sling-extract-asf, state, running 2020-04-09T01:38:22+0000, http://e-jobs.aria.hysds.io:15672, rabbitmq, spyddder-sling-extract-asf, ready, 1 2020-04-09T01:38:22+0000, http://e-jobs.aria.hysds.io:15672, rabbitmq, spyddder-sling-extract-asf, unacked, 0 2020-04-09T01:38:22+0000, http://e-jobs.aria.hysds.io:15672, rabbitmq, standard_product-s1gunw-topsapp-pleiades, state, running 2020-04-09T01:38:22+0000, http://e-jobs.aria.hysds.io:15672, rabbitmq, standard_product-s1gunw-topsapp-pleiades, ready, 51 2020-04-09T01:38:22+0000, http://e-jobs.aria.hysds.io:15672, rabbitmq, standard_product-s1gunw-topsapp-pleiades, unacked, 131

2. Minimal Key-Value SDSWatch log type

Simplified logs for PGE developers
Log schema: <timestamp ISO 8601>, <key>, <value>
File location and format: /data/work/jobs/<year>/<month>/<day>/<hour>/<minute>/<source_id>/<source_type>.pge.sdswatch.log
- PGE logs are typically in the data directory of Verdi job worker.
Example log:
'2020-05-25 01:52:40.569', key1, value1 '2020-05-25 01:52:40.570', auxiliary_key, value3 '2020-05-25 01:52:40.570', key2, value2

3. Full Dictionary SDSWatch log type

Most powerful log format for aggregating key-value pairs into single log lines
Log schema: <timestamp ISO 8601>, <key1>=<value1>, <key2>=<value2>, <keyN>=<valueN>
File location and format: /home/ops/mozart/log/sdswatch/*.fulldict.sdswatch.log
Example log:
'2020-05-25 01:52:40.569', e-jobs.aria.hysds.io, rabbitmq, spyddder-sling-extract-asf, service=elasticsearch activestate=running activestatetimestamp=<TS> '2020-05-25 01:52:40.570', e-jobs.aria.hysds.io, rabbitmq, spyddder-sling-extract-asf, service=rabbitmq-server activestate=running activestatetimestamp=<TS> '2020-05-25 01:52:40.570', e-jobs.aria.hysds.io, rabbitmq, spyddder-sling-extract-asf, service=redis activestate=running activestatetimestamp=<TS>

SDSWatch server

The SDSWatch server collects and provides analytics of metrics.

Redis is leveraged as broker transport for delivery to Elasticsearch. Using Elasticsearch alone doesn’t scale well when there is a large volume of logs coming in, and Redis is used to solve that problem. This follows how Verdi already uses Logstash to scale up delivery to Elasticsearch.

Directory structure

Client

verdi/
  share/
    sdswatch-client/
      data/
      sdswatch-client.sh
  etc/
    logstash.conf
    filebeat.yml
    filebeat-configs/

Server

tmp/
  sdswatch-server/
    data/
    conf/
      logstash.conf
    load-kibana-dashboard/
      Dockerfile
      load-kibana-dashboard.sh
      sdswatch-dashboard.json
      wait-for.it.sh
    docker-compose.yml
    sdswatch-server.sh
    .env

Requirements

TODO: Are these requirements meant as guides for initial (or future) development? They seem to describe design assumptions and constraints.

SDSWatch shall be scalable along with Verdi workers.

Log schema for ELK stack

SDSWatch shall monitor for line updates generated by components to monitor.

The log file shall have the following format:

delimiter: comma
schema: timestamp (ISO 8601), host, source type, source id, metric key, metric value
- the values of the schema tokens should be quoted to allow for commas within the quotes.
inspired by commercial Splunk’s metrics approach. see https://docs.splunk.com/Documentation/Splunk/8.0.3/Metrics/Overview
example:
2020-04-09T01:38:22+0000 , http://e-jobs.aria.hysds.io:15672 , rabbitmq , spyddder-sling-extract-asf , state, running 2020-04-09T01:38:22+0000 , http://e-jobs.aria.hysds.io:15672 , rabbitmq , spyddder-sling-extract-asf , ready, 1 2020-04-09T01:38:22+0000 , http://e-jobs.aria.hysds.io:15672 , rabbitmq , spyddder-sling-extract-asf , unacked, 0 2020-04-09T01:38:22+0000 , http://e-jobs.aria.hysds.io:15672 , rabbitmq , standard_product-s1gunw-topsapp-pleiades , state, running 2020-04-09T01:38:22+0000 , http://e-jobs.aria.hysds.io:15672 , rabbitmq , standard_product-s1gunw-topsapp-pleiades , ready, 51 2020-04-09T01:38:22+0000 , http://e-jobs.aria.hysds.io:15672 , rabbitmq , standard_product-s1gunw-topsapp-pleiades , unacked, 131

Stream metrics to ES/Kibana

SDSWatch shall be able to run standalone outside of a cloud vendor.
SDSWatch shall conform to the ELK stack components.
SDSWatch shall stream metrics to Elasticsearch.
Visualization of key/value of components shall be enabled via Kibana.

Stream metrics to AWS CloudWatch

SDSWatch shall stream to Amazon CloudWatch for AWS deployments

Implementation

Main developer: @vitrandao

Code repository: https://github.com/hysds/hysds-sdswatch

Client-side Logstash

Run-time updating of logstash.conf
- By default, Logstash checks for configuration changes every 3 seconds.
- https://www.elastic.co/guide/en/logstash/current/reloading-config.html
- Verdi updates host/conf/logstash.conf for every job iteration
Outputs to remote Redis on SDSWatch Service
- See https://medium.com/@pereiragoncalo/logging-with-logstash-elasticsearch-kibana-and-redis-e855bc08975d
For PGE SDSWatch log type, Filebeat extracts the schema tokens from the filepath. Hosts can be extracted from an environment variable. see https://www.elastic.co/guide/en/beats/filebeat/current/using-environ-vars.html

Verdi job worker

Verdi job worker updates logstash.conf with new job work directory’s *.sdswatch.log for each job iteration

Service container

Docker run needs to expose -v bindings for the redis port and kibana port
Logstash
- Redis input plugin to read from redis db
- ES output plugin to save into Elasticsearch

Installation

Clone the GitHub repository to the local machine

git clone https://github.com/hysds/hysds-sdswatch.git
cd hysds-sdswatch

Set up a SDSWatch server

Open up a new terminal instance.
Connect to your remote server:
ssh -i <path/to/identity_key_file> hysdsops@<server-ip-address>
Go back to your local hysds-sdswatch directory.
Secure copy onserver/sdswatch-server directory to the tmp directory in the remote server:
scp -i <path/to/identity_key_file> -r onserver hysdsops@<server-ip-address>:/export/home/hysdsops/tmp/
Note that /export/home/hysdsops is the home directory.
Go back to your remote server.
Create an empty data directory to save Elasticsearch documents:
mkdir -p ~/tmp/sdswatch-server/data
Give a read and write permission for the data directory:
chmod 777 -R ~/tmp/sdswatch-server/data
Run the server:
sh ~/tmp/sdswatch-server/sdswatch-server.sh

Set up a SDSWatch client

Since Mamba cluster’s Factotum has no internet connection, a Logstash (7.1.1) image needs to be imported first, in order for Docker to run correctly.

Open up another terminal instance.
Connect to your remote client:
ssh -i <path/to/identity_key_file> hysdsops@<client-ip-address>
Go back to your local hysds-sdswatch directory.
Secure copy onclient/sdswatch-server directory to the share directory in the remote client:
scp -i <path/to/identity_key_file> -r onclient hysdsops@<server-ip-address>:/export/home/hysdsops/verdi/share/
Note that /export/home/hysdsops is the home directory.
Go back to your remote client.
Create an empty data directory to save Logstash history:
mkdir -p ~/verdi/share/sdswatch-client/data
Copy logstash.conf to /export/home/hysdsops/verdi/etc/:
cp ~/verdi/share/sdswatch-client/logstash.conf /export/home/hysdsops/verdi/etc/
Run the client:
sh ~/verdi/share/sdswatch-client/sdswatch-client.sh

If there are logs to ship, you should see a Logstash output similar to the following:

{
  "source_id" => "user_rules_job",
  "host" => "https://<rabbitmq-ip-address>:15673",
  "metric_key" => "state",
  "log_path" => "/verdi/rabbitmq_queue_monitor_to_sdswatch-00.sdswatch.log",
  "sdswatch_timestamp" => 2020-05-29T02:30:18.000Z,
  "metric_value_float" => -1,
  "@version" => "1",
  "metric_value_string" => "running",
  "message" => "2020-05-29T02:30:18+00:00 , https://<rabbitmq-ip-address>:15673 , rabbitmq.queue , user_rules_job , state, running",
  "source_type" => "rabbitmq.queue"
}

Configure Supervisord to automatically start SDSWatch Client on reboot (optional)

Create a supervisor.d file if it does not exist already:
vi ~/verdi/etc/supervisor.d
Add the following configuration to supervisor.d:
[program:sdswatch-client] directory=/export/home/hysdsops/verdi/share/sdswatch-client/ command=/export/home/hysdsops/verdi/share/sdswatch-client/sdswatch-client.sh process_name=%(program_name)s-%(process_num)02d priority=1 numprocs=1 numprocs_start=0 redirect_stderr=true stdout_logfile=%(here)s/../log/%(program_name)s-%(process_num)02d.log stdout_logfile_maxbytes=10MB stdout_logfile_backups=10 startsecs=10

3. Activate supervisor.d:

supervisorctl start sdswatch-client:sdswatch-client-00

4. Check if sdswatch-client is running correctly:

supervisorctl status
supervisorctl reread
supervisorctl update

Troubleshoot

run tail -f /export/home/hysdsops/verdi/log/sdswatch-client-00.log

Demonstration

Instrumenting existing PGE code with SDSWatchLogger (Python)

Any metrics saved to file <job_type>.pge.sdswatch.log in parent work directories are scooped up by the SDSWatchAgent running in the Verdi job worker as a background process. This means that PGE developers no longer need to configure Elasticsearch endpoints manually.

Ensure that each log file is named <job_type>.pge.sdswatch.log and placed in the same root work directory as the main module.

/data/work/jobs/2020/03/17/07/35/job1/download_type.pge.sdswatch.py
/data/work/jobs/2020/03/17/07/35/job2/processing_type.pge.sdswatch.log

Download and install `hysds-sdswatch` via `pip`:

pip3 install git+https://github.com/hysds/hysds-sdswatch.git@master

TODO: Is global installation with pip a recommended way to manage dependencies in a system?

Instantiate SDSWatchLogger with SDSWatchLogger.configure_pge_logger(file_dir: str, name: str)

# example_main_module.py
from sdswatch.sdswatchlogger import SDSWatchLogger as sdsw_logger

sdsw_logger.configure_pge_logger("/path/to/job/dir", "example_hello_world")

Log with SDSWatchLogger.log(key: str, value: str)

A custom key and its corresponding value are appended to the last two columns of the log file.

# example_auxiliary_module.py
from sdswatch.sdswatchlogger import SDSWatchLogger as sdsw_logger

def download():
  sdsw_logger.log("auxiliary_key", "value3")

# example_main_module.py
from sdswatch.sdswatchlogger import SDSWatchLogger as sdsw_logger
from example_auxiliary_module import download

sdsw_logger.configure_pge_logger("/path/to/job/dir", "example_hello_world")

if __name__ == "__main__":
  sdsw_logger.log("key1", "value1")
  download()
  sdsw_logger.log("key2", "value2")

Sample output

# example_hello_world.pge.sdswatch.log
'2020-05-25 01:52:40.569',key1,value1
'2020-05-25 01:52:40.570',auxiliary_key,value3
'2020-05-25 01:52:40.570',key2,value2

Future work

Design improvements

On the client-side, Logstash is used instead of Filebeat and it needs to be replaced in the future. Even though Logstash is similar to Filebeat and provides more log processing capability that Filebeat doesn’t have. Filebeat is more lightweight and thus more suitable for the client-side. We’re looking forward to removing logstash in the client-side, and copy the current Logstash configuration in the client-side to the logstash configuration in the server-side.

On the client side: Filebeat is installed across multiple compute node, shipping data to the server side
On the server side: Logstash receives data from Filebeat, ships it to Elasticsearch database. Kibana is used for visualization.

Tips for migrating the client-side Logstash to Filebeat

Understand how the client and server currently work first by playing around with it.
I recommend reading all the relevant files since there are not many (you should ignore all the Filebeat files on the client side for now) (tip: starting with the sdswatch-server.sh and sdswatch-client.sh). Try to understand the Logstash configuration file on the client side.
When migrating Logstash to Filebeat on the client side, the only thing you need to modify on the server side is the Logstash configuration (just adding a filter block between input block and output block).
I recommend trying out Filebeat on your local machine with SDSWatch logs first. Try using Filebeat to scoop up sdswatch logs and send it to console. Then look at the output logs printed in console, and investigate the fields inside the output logs. Then compare it with the assumed input in the current Logstash configuration in SDSWatch-client. Try to play around with Filebeat “add_field” feature and see if you can make the output logs from Filebeat to have the required information.
When you figure out how to make the output logs from Filebeat look right, check out filebeat.yml and filebeat-configs that I wrote that are currently in the system or on github. Using these files as a starting point. (Remember to always allow “live reloading” feature so we can always update it during production)
I also already wrote a configuration file to create docker container with Filebeat, you may want to log into hysdsops@<your-client-ip-address> and find the directory /export/home/hysdsops/verdi/share/filebeat-sdswatch-client. filebeat-sdswatch-client is similar to sdswatch-client but it’s for filebeat. However, when I ran it, there was an error that I couldn’t figure out why. This problem prevented me from migrating Logstash to Filebeat during my internship. Error when running Filebeat in docker: /usr/local/bin/docker-entrypoint: line 8: exec: filebeat: not found (I asked one of the people in Elastic forum, you may find it helpful)
The filebeat relevant files weren’t tested yet so read it with a grain of salt.

Resources

Logstash configuration: https://www.elastic.co/guide/en/logstash/current/configuration.html
Materials relevant to Environment variables in Logstash, Filebeat may be helpful.
Logstash directory layout on docker container: https://www.elastic.co/guide/en/logstash/current/dir-layout.html (you can find similar websites for Elasticsearch, Filebeat and Kibana)
Logstash file input: https://www.elastic.co/guide/en/logstash/current/plugins-inputs-file.html
Logstash redis input: https://www.elastic.co/guide/en/logstash/current/plugins-inputs-redis.html
Logstash filter: https://www.elastic.co/guide/en/logstash/current/filter-plugins.html
Basic filebeat configuration: https://www.elastic.co/guide/en/beats/filebeat/current/filebeat-configuration.html
Don’t fall into these pitfalls when using Filebeat. https://logz.io/blog/filebeat-pitfalls/
Filebeat processor: https://www.elastic.co/guide/en/beats/filebeat/current/filtering-and-enhancing-data.html (similar to Logstash filter). This can be very helpful to adjust Filebeat output logs before sending it to Logstash on the server side.
Filebeat log input: https://www.elastic.co/guide/en/beats/filebeat/current/filebeat-input-log.html
Docker materials relevant to volume mount, environment variable, and running command line inside docker container may be helpful.
Materials related to giving read and write permission to a docker container is very very helpful.
Materials related to save and load docker images are helpful when the compute node (e.g verdi, factotum) doesn’t have access to the internet.
To set up elastic stack on docker container, look at this https://www.elastic.co/guide/en/elastic-stack-get-started/current/get-started-docker.html.

To-dos

Replace Logstash on the client side with Filebeat.

Move Logstash filtering code to the server side.

Try to find a better way to give read and write permission for the server-side docker container to write into /export/home/hysdsops/tmp/sdswatch-server/data.

Install SDSWatch client on all compute nodes.

Test at scale.

Related Articles:
Page: SDSWatch Page: SDSWatch Metrics

Have Questions? Ask a HySDS Developer:

Anyone can join our public Slack channel to learn more about HySDS. JPL employees can join #HySDS-Community

JPLers can also ask HySDS questions at Stack Overflow Enterprise

Page Information:

Was this page useful?

Yes No

Contribution History:

Gerald Manipon (1497 days ago)
Topher Allen (1697 days ago)
Hook Hua (1771 days ago)
Sehoon Park (1772 days ago)
vitrandao (1780 days ago)

Subject Matter Expert:

@Hook Hua

Find an Error?

Is this document outdated or inaccurate? Please contact the assigned Page Maintainer: