SDSWatch

SDSWatch is designed to gain insight into what compute nodes are doing in real-time, and Kibana dashboard / visualization with the power of aggregating / faceting is used to achieve that purpose. Please click on this link if you’re interested in seeing what Kibana dashboard for SDSWatch looks like.

Content:

What are potential use cases of SDSWatch ?
How to use SDSWatch as a client ?
How to emit SDSWatch logs in Python with SDSWatchLogger package ? (optional)
Guide for future development of SDSWatch.

What are potential use cases of SDSWatch ?

Use case 1: metrics collection from inside PGE processing steps

The basic use case of arbitrary PGEs output key essential log metrics for sdswatch to scoop up for analytics. By having a basic schema of csv-style log file output by PGEs, any component can be able to output metrics that will be streamed back for analytics in the cloud. This would then not require components to know anything about the cloud. PGEs running in a fleet of ASGs could each output key/value in csv files in their own workers. sdswatch streams back metrics to ELK stack for analysis.

This approach enables even legacy fortran algorithms to emit metrics that can be scooped up by sdswatch and stream to aggregation for analytics.

An example sdswatch key/value may also include:

2020-03-03 10:26:00, “192.168.1.250”, “hysds-io-standard_product-s1gunw-topsapp:release-sp-20191121”, “step”, “coregistration”

2020-03-03 10:40:00, “192.168.1.250”, “hysds-io-standard_product-s1gunw-topsapp:release-sp-20191121”, “step”, “phase unwrapping”

Example job to instrument with sdswatch: job-acquisition_ingest-scihub:release-20190710

Use case 2: metrics collection from system crons

For some adaptations of hsyds-core, periodic submission of jobs are done by lambda and crons. For crons, they usually leave log files that are invisible to Metrics. By simply allowing these old cron scripts to emit sdswatch csv logs, sdswatch should be able to monitor for line updates and ship them to elasticsearch for analysis.

An important use case within here is the ability to capture cron script warning or failures. Without sdswatch, any errors would not be known unless logged into the system.

Use case 3: metrics collection in critical hysds components

Similar to use case 2, but essential for monitoring key hysds core components.For example:

mozart
- orchestrator: how jobs are being routed into the Mozart queues.
- process_events: how component events are streamed back from workers to mozart/ES for figaro view.
factotum
- workers in factotum: tracking states, errors, etc.

Use case 4: metrics collection on non-error job states e.g. topsApp short-circuiting existing datasets

job-standard_product-s1gunw-topsapp:release-sp-20191121 currently takes on average 65-minutes to process one S1-GUNW data product. At the beginning of topsApp PGE, it checks if the expected output already exists, if so, it exits immediately. This use case for sdswatch would emit sdswatch metric for dataset existence and short-circuiting to be reporting.

Use case 5: metrics analysis via Kibana

Enable the visualization of key/value metrics first from aggregate across all workers and components. This enables viewing statistics of key/values such as min/mean/max values of a reported metric. Then facet into one compute node into a worker to see metrics just for that worker Then facet onto one single metric to see its value reported over time.

Dashboard panels to support faceting:

keys over time
values over time
table distribution of keys
table distribution of values
table distribution of IP addresses
table distribution of component IDs

Use case 6: metrics collection for verdi events e.g. hariki

Verdi job worker has many states that can be reported to sdswatch for analysis in real-time. e.g. hariki, job states, etc. verdi could update sdswathc logs for enabling insights into job worker events.

How to use SDSWatch as a client ?

To gain insight through Kibana, first you need to emit logs in format, with name, and at directory that SDSWatch requires.

Generic SDSWatch log

a. These logs typically will be in the log dir managed by supervisord: /home/ops/verdi/log/<name>.sdswatch.log

b. Naming your sdswatch log as <name>.sdswatch.log

c. Format: <timestamp iso 8601>, <host>, <source_type>, <source_id>, <key>, <value>

2020-04-09T01:38:22+0000 , http://e-jobs.aria.hysds.io:15672 , rabbitmq , spyddder-sling-extract-asf , state, running
2020-04-09T01:38:22+0000 , http://e-jobs.aria.hysds.io:15672 , rabbitmq , spyddder-sling-extract-asf , ready, 1 
2020-04-09T01:38:22+0000 , http://e-jobs.aria.hysds.io:15672 , rabbitmq , spyddder-sling-extract-asf , unacked, 0 
2020-04-09T01:38:22+0000 , http://e-jobs.aria.hysds.io:15672 , rabbitmq , standard_product-s1gunw-topsapp-pleiades , state, running 
2020-04-09T01:38:22+0000 , http://e-jobs.aria.hysds.io:15672 , rabbitmq , standard_product-s1gunw-topsapp-pleiades , ready, 51 
2020-04-09T01:38:22+0000 , http://e-jobs.aria.hysds.io:15672 , rabbitmq , standard_product-s1gunw-topsapp-pleiades , unacked, 131

d. The format is inspired by commercial Splunk’s metrics approach. see https://docs.splunk.com/Documentation/Splunk/8.0.3/Metrics/Overview

e. This will be used for hysds core components (e.g. verdi, mozart, grq, etc.) logs

PGE SDSWatch log

a. These logs typically will be on verdi job worker: /data/work/job/<year>/<month>/<hour>/<minute>/<source_id>/<source_type>.pge.sdswatch.log

b. Naming your sdswatch log as <source_type>.pge.sdswatch.log

c. Format: <timestamp ISO 8601>, <key>, <value>

Note: Use double quote to allow comma within token value.

How to emit SDSWatch logs in Python with SDSWatchLogger package? (optional)

Download and install hysds-sdswatch via pip:

pip3 install git+https://github.com/hysds/hysds-sdswatch.git@master

Github repo: https://github.com/hysds/hysds-sdswatch

Importing SDSWatchLogger into your program

# For generic type
from sdswatch.logger import SDSWatchLogger

# For PGE type
from sdswatch.pgelogger import PGESDSWatchLogger

Methods that SDSWatchLogger provides

For generic type:

from sdswatch.logger import SDSWatchLogger

# you can only instantiate once
logger = SDSWatchLogger(file_dir="/path/to/dir", 
                        name="logname", 
                        source_type="source_type", 
                        source_id="source_id")

# to use the logger in other modules after the first instantiation
# logger = SDSWatchLogger.get_logger()

# to log 
logger.log(metric_key="key",
           metric_value="value")

For pge type:

from sdswatch.pgelogger import PGESDSWatchLogger

# you can only instantiate once
logger = PGESDSWatchLogger(file_dir="/path/to/dir", 
                           name="job_type")

# to use the logger in other modules after the first instantiation
# logger = PGESDSWatchLogger.get_logger()

# to log 
logger.log(metric_key="key",
           metric_value="value")

Sample Python code for PGE type log

Note: the Logger will automatically create a new log file when instantiated.

example_main_module.py

from sdswatch.pgelogger import PGESDSWatchLogger
from example_auxiliary_module import download

sdsw_logger = PGESDSWatchLogger("/path/to/job/dir", "example_hello_world")

if __name__ == "__main__":
  sdsw_logger.log("step", "pre-download")
  download()
  sdsw_logger.log("step", "post-download")

example_auxiliary_module.py

from sdswatch.pgelogger import PGESDSWatchLogger

def download():
  sdsw_logger = PGESDSWatchLogger.get_logger()
  sdsw_logger.log("step", "download")

example_hello_world.pge.sdswatch.log

'2020-05-25 01:52:40.569',step,pre-download
'2020-05-25 01:52:40.570',step,download
'2020-05-25 01:52:40.570',step,post-download

Guide for future development of SDSWatch

Reference Design for SDSWatch (Elastic stack)

On the client side: Filebeat is installed across multiple compute node, shipping data to the server side
On the server side: Logstash receives data from Filebeat, ships it to Elasticsearch database. Kibana is used for visualization

Current SDSWatch Design with regard to one compute node

Important Note: In the client side, Logstash is used here instead of Filebeat, and it needs to be replaced in the future. Even though Logstash is similar to Filebeat and provides more log processing capability that Filebeat doesn’t. Filebeat is much lighter weight and more suitable for the client side. We’re looking forward to removing logstash in the client side, and copy the current Logstash configuration in the client side to the logstash configuration in the server side.

Client side (on compute node): Logtash with appropriate configuration will listen to sdswatch log files at two locations

/data/work/jobs/<year>/<month>/<day>/<hour>/<minute>/<job_id>/<job_type>.pge.sdswatch.log for pge logs.
/home/ops/verdi/log/<name>.sdswatch.log for any other sdswatch logs (currently only job worker) on the compute node

Logstash will ship and process the these logs to SDSWatch server.

Server side: Redis, Logstash, Elasticsearch, Kibana

Redis is leveraged as broker transport for delivery to Elasticsearch. In the past, using Elasticsearch alone didn’t scale well when there were a large volume of logs coming in, and Redis is used to solve that problem.

Files and directories that are relevant to SDSWatch and how to run them

I first developed SDSWatch on hysdsops@100.67.35.12 for the server side and hysdsops@100.67.32.182 for the client side, so they may be still there.

On hysdsops@100.67.32.182 for the client side, here is my work:

/export/home/hysdsops/verdi/share/sdswatch-client
/export/home/hysdsops/verdi/etc/logstash.conf
/export/home/hysdsops/verdi/etc/filebeat.yml (not in the current SDSWatch-client implementation, but you may find it helpful when migrating Logstash to Filebeat on the client side)
/export/home/hysdsops/verdi/etc/filebeat-configs (not in the current SDSWatch-client implementation, but you may find it helpful when migrating Logstash to Filebeat on the client side)

On hysdsops@100.67.35.12 for the server side, here is my work:

/export/home/hysdsops/tmp/sdswatch-server
Note: in /export/home/hysdsops/tmp/sdswatch-server, there is a file .env with content
```
KIB=5601
REDIS=6379
```

To run SDSWatch client on hysdsops@100.67.32.182,

supervisor.d is already set up to run it automatically
check out section To check if the sdswatch-client is running correctly below.

To run SDSWatch server on hysdsops@100.67.35.12,

run “cd /export/home/hysdsops/tmp/sdswatch-server”, then run “bash sdswatch-server.sh”
Note: I used “chmod 777 -R data” for the data directory inside “sdswatch-server' since couldn’t find a way to make it work without using 777 mode 😞
After running 2 command lines above, here is the link to open Kibana dashboard: https://100.67.35.12:1502/app/kibana#/home?_g=()

Highly recommended: log in to both servers, play around with the current SDSWatch, and read all the code since there are not many. I recommend finding the corresponding files here https://github.com/hysds/hysds-sdswatch because there are more documentation. (tips: if you are reading the client side, find sdswatch-client.sh and read it first. If you are reading the server side, find sdswatch-server.sh and read it first. Ignore all relevant files to filebeat for now)

In case all the files are lost, here is how to set it up.

How to set up the server side

Go to https://github.com/hysds/hysds-sdswatch, open onserver directory and download all the files in it.
Log in to server hysdsops@100.67.35.12
Copy the sdswatch-server directory from your machine and put it in /export/home/hysdsops/tmp/ in the server. Also, in the /export/home/hysdsops/tmp/sdswatch-server/, create an empty data directory to save Elasticsearch data.
To start SDSWatch server, on the server’s terminal
- cd /export/home/hysdsops/tmp/sdswatch-server
- give read and write permission for container to to write into /export/home/hysdsops/tmp/sdswatch-server/data directory (“chmod 777 -R data” works, but you cannot do it because of security reason. I couldn’t think of ways to make it work without “chmod 777 -R data”)
- bash sdswatch-server.sh
You can access Kibana through localhost:5601 of the server.

How to set up the client side

Go to https://github.com/hysds/hysds-sdswatch, open onclient directory and download all the files in it.
Log in to server hysdsops@100.67.32.182
Copy the sdswatch-client directory from your machine and put it in /export/home/hysdsops/verdi/share/ in the server. Also, in the /export/home/hysdsops/verdi/share/sdswatch-client/, create an empty data directory to save Logstash history (Don’t delete this directory when there are a lot of old sdswatch logs on your system. If you delete this directory, Logstash will resend old logs).
Copy the logstash.conf file and put it in /export/home/hysdsops/verdi/etc/
Notice that there are a few filebeat files and directories. Don’t touch it for now. I’ll mention them later.

How to add sdswatch-sclient to supervisor.d so supervisor.d will automatically start Logstash when the server boots up again.

Note for hysdsops@100.67.32.182: I already did it, but you may find it helpful to read it again.

supervisor.d

[program:sdswatch-client]

directory=/export/home/hysdsops/verdi/share/sdswatch-client/

command=/export/home/hysdsops/verdi/share/sdswatch-client/sdswatch-client.sh

process_name=%(program_name)s-%(process_num)02d

priority=1

numprocs=1

numprocs_start=0

redirect_stderr=true

stdout_logfile=%(here)s/../log/%(program_name)s-%(process_num)02d.log

stdout_logfile_maxbytes=10MB

stdout_logfile_backups=10

startsecs=10

Here is the step to check if sdswatch-client is running, run these command lines on the client side

supervisorctl status

supervisorctl reread

supervisorctl update

If you do it correctly, then run “supervisorctl update” again should give you the following:

(verdi) hysdsops@ip-100-67-32-182:~/verdi/etc$ supervisorctl status

sdswatch-client:sdswatch-client-00          RUNNING   pid 17993, uptime 0:08:23

if it’s not running, you can run “supervisorctl start sdswatch-client:sdswatch-client-00”

To check if the sdswatch-client is running correctly

On client server, run tail -f /export/home/hysdsops/verdi/log/sdswatch-client-00.log (If you look at the supervisor.d script above you’ll see this line “stdout_logfile=%(here)s/../log/%(program_name)s-%(process_num)02d.log“. This is where logstash directs its message)
If the server hasn’t been opened yet or there are no logs to ship, you should see 2 errors (this is ok)
- Unable to retrieve license information from license server {:message=>"No Available connections"}
- Attempted to resurrect connection to dead ES instance, but got an error. {:url=>"http://elasticsearch:9200/", :error_type=>LogStash::Outputs::ElasticSearch::HttpClient::Pool::HostUnreachableError, :error=>"Elasticsearch Unreachable: [http://elasticsearch:9200/][Manticore::SocketException] Connection refused (Connection refused)"}
If there are logs to ship, you should see a log similar to this:

{

  "source_id" => "user_rules_job",
  
  "host" => "https://100.67.33.56:15673",
  
  "metric_key" => "state",
  
  "log_path" => "/verdi/rabbitmq_queue_monitor_to_sdswatch-00.sdswatch.log",
  
  "sdswatch_timestamp" => 2020-05-29T02:30:18.000Z,
  
  "metric_value_float" => -1,
  
  "@version" => "1",
  
  "metric_value_string" => "running",
  
  "message" => "2020-05-29T02:30:18+00:00 , https://100.67.33.56:15673 , rabbitmq.queue , user_rules_job , state, running",
  
  "source_type" => "rabbitmq.queue"

}

Next Step

Replace Logstash on the client side with Filebeat.
Move Logstash filtering code to server side
Test
On the server side, try to find a way to give read and write permission for the docker container to write into /export/home/hysdsops/tmp/sdswatch-server/data without using “chmod 777”
Install SDSWatch client on all compute nodes and test at scale.

Tips for migrating Logstash to Filebeat on client side

Understand how the client and server currently work first by playing around with it.
I recommend reading all the relevant files since there are not many (you should ignore all the Filebeat files on the client side for now) (tip: starting with the sdswatch-server.sh and sdswatch-client.sh). Try to understand the Logstash configuration file on the client side.
When migrating Logstash to Filebeat on the client side, the only thing you need to modify on the server side is the Logstash configuration (just adding a filter block between input block and output block).
I recommend trying out Filebeat on your local machine with SDSWatch logs first. Try using Filebeat to scoop up sdswatch logs and send it to console. Then look at the output logs printed in console, and investigate the fields inside the output logs. Then compare it with the assumed input in the current Logstash configuration in SDSWatch-client. Try to play around with Filebeat “add_field” feature and see if you can make the output logs from Filebeat to have the required information.
When you figure out how to make the output logs from Filebeat look right, check out filebeat.yml and filebeat-configs that I wrote that are currently in the system or on github. Using these files as a starting point. (Remember to always allow “live reloading” feature so we can always update it during production)
I also already wrote a configuration file to create docker container with Filebeat, you may want to log into hysdsops@100.67.32.182 and find the directory /export/home/hysdsops/verdi/share/filebeat-sdswatch-client. filebeat-sdswatch-client is similar to sdswatch-client but it’s for filebeat. However, when I ran it, there was an error that I couldn’t figure out why. This problem prevented me from migrating Logstash to Filebeat during my internship. Error when running Filebeat in docker: /usr/local/bin/docker-entrypoint: line 8: exec: filebeat: not found (I asked one of the people in Elastic forum, you may find it helpful)
The filebeat relevant files weren’t tested yet so read it with a grain of salt.

Resources you may find helpful

Logstash configuration: https://www.elastic.co/guide/en/logstash/current/configuration.html
Materials relevant to Environment variables in Logstash, Filebeat may be helpful.
Logstash directory layout on docker container: https://www.elastic.co/guide/en/logstash/current/dir-layout.html (you can find similar websites for Elasticsearch, Filebeat and Kibana)
Logstash file input: https://www.elastic.co/guide/en/logstash/current/plugins-inputs-file.html
Logstash redis input: https://www.elastic.co/guide/en/logstash/current/plugins-inputs-redis.html
Logstash filter: https://www.elastic.co/guide/en/logstash/current/filter-plugins.html
Basic filebeat configuration: https://www.elastic.co/guide/en/beats/filebeat/current/filebeat-configuration.html
Don’t fall into these pitfalls when using Filebeat. https://logz.io/blog/filebeat-pitfalls/
Filebeat processor: https://www.elastic.co/guide/en/beats/filebeat/current/filtering-and-enhancing-data.html (similar to Logstash filter). This can be very helpful to adjust Filebeat output logs before sending it to Logstash on the server side.
Filebeat log input: https://www.elastic.co/guide/en/beats/filebeat/current/filebeat-input-log.html
Docker materials relevant to volume mount, environment variable, and running command line inside docker container may be helpful.
Materials related to giving read and write permission to a docker container is very very helpful.
Materials related to save and load docker images are helpful when the compute node (e.g verdi, factotum) doesn’t have access to the internet.