SDSWatch Metrics
Confidence Level High This article been formally reviewed and is signed off on by a relevant subject matter expert. |
---|
Intro
SDSWatch is a logging mechanism targeting the collection of insights and metrics.
Use cases
Metrics collection inside PGE processing steps
The basic use case of arbitrary PGEs output key essential log metrics for SDSWatch to scoop up for analytics. By having a basic schema of csv-style log file output by PGEs, any component can be able to output metrics that will be streamed back for analytics in the cloud. This would then not require components to know anything about the cloud. PGEs running in a fleet of ASGs could each output key/value in csv files in their own workers. sdswatch streams back metrics to ELK stack for analysis.
This approach enables even legacy fortran algorithms to emit metrics that can be scooped up by SDSWatch and stream to aggregation for analytics.
An example SDSWatch key/value may also include:
2020-03-03 10:26:00, “192.168.1.250”, “hysds-io-standard_product-s1gunw-topsapp:release-sp-20191121”, “step”, “coregistration”
2020-03-03 10:40:00, “192.168.1.250”, “hysds-io-standard_product-s1gunw-topsapp:release-sp-20191121”, “step”, “phase unwrapping”
Example job to instrument with SDSWatch: job-acquisition_ingest-scihub:release-20190710
Metrics collection from system crons
For some adaptations of hsyds-core, periodic submission of jobs are done by lambda and crons. For crons, they usually leave log files that are invisible to Metrics. By simply allowing these old cron scripts to emit SDSWatch CSV logs, SDSWatch should be able to monitor for line updates and ship them to elasticsearch for analysis.
An important use case within here is the ability to capture cron script warning or failures. Without SDSWatch, any errors would not be known unless logged into the system.
Metrics collection in critical HySDS components
Similar to the previous use case, but essential for monitoring HySDS core components. For example:
Mozart
orchestrator: how jobs are being routed into the Mozart queues.
process_events: how component events are streamed back from workers to Mozart ES for Figaro view.
Factotum
workers: tracking states, errors, etc.
Metrics collection on internal states of PGEs, e.g. short-circuiting existing datasets in topsApp
job-standard_product-s1gunw-topsapp:release-sp-20191121 currently takes on average 65-minutes to process one S1-GUNW data product. At the beginning of topsApp PGE, it checks if the expected output already exists, if so, it exits immediately. This use case for sdswatch would emit sdswatch metric for dataset existence and short-circuiting to be reporting.
Metrics analysis via Kibana
Enable the visualization of key/value metrics first from aggregate across all workers and components. This enables viewing statistics of key/values such as min/mean/max values of a reported metric. Then facet into one compute node into a worker to see metrics just for that worker Then facet onto one single metric to see its value reported over time.
Dashboard panels to support faceting:
keys over time
values over time
table distribution of keys
table distribution of values
table distribution of IP addresses
table distribution of component IDs
Metrics collection for Verdi events, e.g. Hariki
Verdi job worker has many states that can be reported to SDSWatch for analysis in real-time. e.g. hariki, job states, etc. verdi could update SDSWatch logs for enabling insights into job worker events.
Design
SDSWatch client (on compute node)
The SDSWatch client monitors SDSWatch log insights and ships them to the server.
There are three SDSWatch log types that the client-side Logstash handles:
1. Key-Value SDSWatch log type
Full-schema logs for system developers working on HySDS core components (e.g., Verdi, Mozart, GRQ, etc).
Log schema:
<timestamp ISO 8601>, <host>, <source type>, <source id>, <metric key>, <metric value>
Schema design is inspired by commercial Splunk’s metrics approach. See https://docs.splunk.com/Documentation/Splunk/8.0.3/Metrics/Overview
File location and format:
/home/ops/mozart/log/sdswatch/*.fullkv.sdswatch.log
Generic logs are typically in the log directory managed by Supervisord
Example log:
2020-04-09T01:38:22+0000, http://e-jobs.aria.hysds.io:15672, rabbitmq, spyddder-sling-extract-asf, state, running 2020-04-09T01:38:22+0000, http://e-jobs.aria.hysds.io:15672, rabbitmq, spyddder-sling-extract-asf, ready, 1 2020-04-09T01:38:22+0000, http://e-jobs.aria.hysds.io:15672, rabbitmq, spyddder-sling-extract-asf, unacked, 0 2020-04-09T01:38:22+0000, http://e-jobs.aria.hysds.io:15672, rabbitmq, standard_product-s1gunw-topsapp-pleiades, state, running 2020-04-09T01:38:22+0000, http://e-jobs.aria.hysds.io:15672, rabbitmq, standard_product-s1gunw-topsapp-pleiades, ready, 51 2020-04-09T01:38:22+0000, http://e-jobs.aria.hysds.io:15672, rabbitmq, standard_product-s1gunw-topsapp-pleiades, unacked, 131
2. Minimal Key-Value SDSWatch log type
Simplified logs for PGE developers
Log schema:
<timestamp ISO 8601>, <key>, <value>
File location and format:
/data/work/jobs/<year>/<month>/<day>/<hour>/<minute>/<source_id>/<source_type>.pge.sdswatch.log
PGE logs are typically in the data directory of Verdi job worker.
Example log:
'2020-05-25 01:52:40.569', key1, value1 '2020-05-25 01:52:40.570', auxiliary_key, value3 '2020-05-25 01:52:40.570', key2, value2
3. Full Dictionary SDSWatch log type
Most powerful log format for aggregating key-value pairs into single log lines
Log schema:
<timestamp ISO 8601>, <key1>=<value1>, <key2>=<value2>, <keyN>=<valueN>
File location and format:
/home/ops/mozart/log/sdswatch/*.fulldict.sdswatch.log
Example log:
'2020-05-25 01:52:40.569', e-jobs.aria.hysds.io, rabbitmq, spyddder-sling-extract-asf, service=elasticsearch activestate=running activestatetimestamp=<TS> '2020-05-25 01:52:40.570', e-jobs.aria.hysds.io, rabbitmq, spyddder-sling-extract-asf, service=rabbitmq-server activestate=running activestatetimestamp=<TS> '2020-05-25 01:52:40.570', e-jobs.aria.hysds.io, rabbitmq, spyddder-sling-extract-asf, service=redis activestate=running activestatetimestamp=<TS>
SDSWatch server
The SDSWatch server collects and provides analytics of metrics.
Redis is leveraged as broker transport for delivery to Elasticsearch. Using Elasticsearch alone doesn’t scale well when there is a large volume of logs coming in, and Redis is used to solve that problem. This follows how Verdi already uses Logstash to scale up delivery to Elasticsearch.
Directory structure
Client
Server
Requirements
TODO: Are these requirements meant as guides for initial (or future) development? They seem to describe design assumptions and constraints.
SDSWatch shall be scalable along with Verdi workers.
Log schema for ELK stack
SDSWatch shall monitor for line updates generated by components to monitor.
The log file shall have the following format:
delimiter: comma
schema: timestamp (ISO 8601), host, source type, source id, metric key, metric value
the values of the schema tokens should be quoted to allow for commas within the quotes.
inspired by commercial Splunk’s metrics approach. see https://docs.splunk.com/Documentation/Splunk/8.0.3/Metrics/Overview
example:
Stream metrics to ES/Kibana
SDSWatch shall be able to run standalone outside of a cloud vendor.
SDSWatch shall conform to the ELK stack components.
SDSWatch shall stream metrics to Elasticsearch.
Visualization of key/value of components shall be enabled via Kibana.
Stream metrics to AWS CloudWatch
SDSWatch shall stream to Amazon CloudWatch for AWS deployments
Implementation
Main developer: @vitrandao
Code repository: https://github.com/hysds/hysds-sdswatch
Client-side Logstash
Run-time updating of
logstash.conf
By default, Logstash checks for configuration changes every 3 seconds.
https://www.elastic.co/guide/en/logstash/current/reloading-config.html
Verdi updates
host/conf/logstash.conf
for every job iteration
Outputs to remote Redis on SDSWatch Service
For PGE SDSWatch log type, Filebeat extracts the schema tokens from the filepath. Hosts can be extracted from an environment variable. see https://www.elastic.co/guide/en/beats/filebeat/current/using-environ-vars.html
Verdi job worker
Verdi job worker updates
logstash.conf
with new job work directory’s*.sdswatch.log
for each job iteration
Service container
Docker run needs to expose -v bindings for the redis port and kibana port
Logstash
Redis input plugin to read from redis db
ES output plugin to save into Elasticsearch
Installation
Clone the GitHub repository to the local machine
git clone https://github.com/hysds/hysds-sdswatch.git
cd hysds-sdswatch
Set up a SDSWatch server
Open up a new terminal instance.
Connect to your remote server:
Go back to your local
hysds-sdswatch
directory.Secure copy
onserver/sdswatch-server
directory to the tmp directory in the remote server:Note that
/export/home/hysdsops
is the home directory.Go back to your remote server.
Create an empty data directory to save Elasticsearch documents:
Give a read and write permission for the data directory:
Run the server:
Set up a SDSWatch client
Since Mamba cluster’s Factotum has no internet connection, a Logstash (7.1.1) image needs to be imported first, in order for Docker to run correctly.
Open up another terminal instance.
Connect to your remote client:
Go back to your local
hysds-sdswatch
directory.Secure copy
onclient/sdswatch-server
directory to the share directory in the remote client:Note that
/export/home/hysdsops
is the home directory.Go back to your remote client.
Create an empty data directory to save Logstash history:
Copy
logstash.conf
to/export/home/hysdsops/verdi/etc/
:Run the client:
If there are logs to ship, you should see a Logstash output similar to the following:
Configure Supervisord to automatically start SDSWatch Client on reboot (optional)
Create a
supervisor.d
file if it does not exist already:Add the following configuration to
supervisor.d
:
3. Activate supervisor.d
:
4. Check if sdswatch-client
is running correctly:
Troubleshoot
Demonstration
Instrumenting existing PGE code with SDSWatchLogger (Python)
Any metrics saved to file <job_type>.pge.sdswatch.log
in parent work directories are scooped up by the SDSWatchAgent running in the Verdi job worker as a background process. This means that PGE developers no longer need to configure Elasticsearch endpoints manually.
Ensure that each log file is named <job_type>.pge.sdswatch.log
and placed in the same root work directory as the main module.
Download and install hysds-sdswatch
via pip
:
Instantiate SDSWatchLogger with SDSWatchLogger.configure_pge_logger(file_dir: str, name: str)
Log with SDSWatchLogger.log(key: str, value: str)
A custom key and its corresponding value are appended to the last two columns of the log file.
Sample output
Future work
Design improvements
On the client-side, Logstash is used instead of Filebeat and it needs to be replaced in the future. Even though Logstash is similar to Filebeat and provides more log processing capability that Filebeat doesn’t have. Filebeat is more lightweight and thus more suitable for the client-side. We’re looking forward to removing logstash in the client-side, and copy the current Logstash configuration in the client-side to the logstash configuration in the server-side.
On the client side: Filebeat is installed across multiple compute node, shipping data to the server side
On the server side: Logstash receives data from Filebeat, ships it to Elasticsearch database. Kibana is used for visualization.
Tips for migrating the client-side Logstash to Filebeat
Understand how the client and server currently work first by playing around with it.
I recommend reading all the relevant files since there are not many (you should ignore all the Filebeat files on the client side for now) (tip: starting with the sdswatch-server.sh and sdswatch-client.sh). Try to understand the Logstash configuration file on the client side.
When migrating Logstash to Filebeat on the client side, the only thing you need to modify on the server side is the Logstash configuration (just adding a filter block between input block and output block).
I recommend trying out Filebeat on your local machine with SDSWatch logs first. Try using Filebeat to scoop up sdswatch logs and send it to console. Then look at the output logs printed in console, and investigate the fields inside the output logs. Then compare it with the assumed input in the current Logstash configuration in SDSWatch-client. Try to play around with Filebeat “add_field” feature and see if you can make the output logs from Filebeat to have the required information.
When you figure out how to make the output logs from Filebeat look right, check out filebeat.yml and filebeat-configs that I wrote that are currently in the system or on github. Using these files as a starting point. (Remember to always allow “live reloading” feature so we can always update it during production)
I also already wrote a configuration file to create docker container with Filebeat, you may want to log into hysdsops@<your-client-ip-address> and find the directory /export/home/hysdsops/verdi/share/filebeat-sdswatch-client. filebeat-sdswatch-client is similar to sdswatch-client but it’s for filebeat. However, when I ran it, there was an error that I couldn’t figure out why. This problem prevented me from migrating Logstash to Filebeat during my internship. Error when running Filebeat in docker: /usr/local/bin/docker-entrypoint: line 8: exec: filebeat: not found (I asked one of the people in Elastic forum, you may find it helpful)
The filebeat relevant files weren’t tested yet so read it with a grain of salt.
Resources
Logstash configuration: https://www.elastic.co/guide/en/logstash/current/configuration.html
Materials relevant to Environment variables in Logstash, Filebeat may be helpful.
Logstash directory layout on docker container: https://www.elastic.co/guide/en/logstash/current/dir-layout.html (you can find similar websites for Elasticsearch, Filebeat and Kibana)
Logstash file input: https://www.elastic.co/guide/en/logstash/current/plugins-inputs-file.html
Logstash redis input: https://www.elastic.co/guide/en/logstash/current/plugins-inputs-redis.html
Logstash filter: https://www.elastic.co/guide/en/logstash/current/filter-plugins.html
Basic filebeat configuration: https://www.elastic.co/guide/en/beats/filebeat/current/filebeat-configuration.html
Don’t fall into these pitfalls when using Filebeat. https://logz.io/blog/filebeat-pitfalls/
Filebeat processor: https://www.elastic.co/guide/en/beats/filebeat/current/filtering-and-enhancing-data.html (similar to Logstash filter). This can be very helpful to adjust Filebeat output logs before sending it to Logstash on the server side.
Filebeat log input: https://www.elastic.co/guide/en/beats/filebeat/current/filebeat-input-log.html
Docker materials relevant to volume mount, environment variable, and running command line inside docker container may be helpful.
Materials related to giving read and write permission to a docker container is very very helpful.
Materials related to save and load docker images are helpful when the compute node (e.g verdi, factotum) doesn’t have access to the internet.
To set up elastic stack on docker container, look at this https://www.elastic.co/guide/en/elastic-stack-get-started/current/get-started-docker.html.
To-dos
/export/home/hysdsops/tmp/sdswatch-server/data
.Related Articles: |
---|
Have Questions? Ask a HySDS Developer: |
Anyone can join our public Slack channel to learn more about HySDS. JPL employees can join #HySDS-Community
|
JPLers can also ask HySDS questions at Stack Overflow Enterprise
|
Page Information: |
---|
Was this page useful? |
Contribution History:
|
Subject Matter Expert: @Hook Hua |
Find an Error? Is this document outdated or inaccurate? Please contact the assigned Page Maintainer: @Hook Hua |