SDSWatch
SDSWatch is designed to gain insight into what compute nodes are doing in real-time, and Kibana dashboard / visualization with the power of aggregating / faceting is used to achieve that purpose. Please click on this link if you’re interested in seeing what Kibana dashboard for SDSWatch looks like.
Content:
What are potential use cases of SDSWatch ?
How to use SDSWatch as a client ?
How to emit SDSWatch logs in Python with SDSWatchLogger package ? (optional)
Guide for future development of SDSWatch.
What are potential use cases of SDSWatch ?
Use case 1: metrics collection from inside PGE processing steps
The basic use case of arbitrary PGEs output key essential log metrics for sdswatch to scoop up for analytics. By having a basic schema of csv-style log file output by PGEs, any component can be able to output metrics that will be streamed back for analytics in the cloud. This would then not require components to know anything about the cloud. PGEs running in a fleet of ASGs could each output key/value in csv files in their own workers. sdswatch streams back metrics to ELK stack for analysis.
This approach enables even legacy fortran algorithms to emit metrics that can be scooped up by sdswatch and stream to aggregation for analytics.
An example sdswatch key/value may also include:
2020-03-03 10:26:00, “192.168.1.250”, “hysds-io-standard_product-s1gunw-topsapp:release-sp-20191121”, “step”, “coregistration”
2020-03-03 10:40:00, “192.168.1.250”, “hysds-io-standard_product-s1gunw-topsapp:release-sp-20191121”, “step”, “phase unwrapping”
Example job to instrument with sdswatch: job-acquisition_ingest-scihub:release-20190710
Use case 2: metrics collection from system crons
For some adaptations of hsyds-core, periodic submission of jobs are done by lambda and crons. For crons, they usually leave log files that are invisible to Metrics. By simply allowing these old cron scripts to emit sdswatch csv logs, sdswatch should be able to monitor for line updates and ship them to elasticsearch for analysis.
An important use case within here is the ability to capture cron script warning or failures. Without sdswatch, any errors would not be known unless logged into the system.
Use case 3: metrics collection in critical hysds components
Similar to use case 2, but essential for monitoring key hysds core components.For example:
mozart
orchestrator: how jobs are being routed into the Mozart queues.
process_events: how component events are streamed back from workers to mozart/ES for figaro view.
factotum
workers in factotum: tracking states, errors, etc.
Use case 4: metrics collection on non-error job states e.g. topsApp short-circuiting existing datasets
job-standard_product-s1gunw-topsapp:release-sp-20191121 currently takes on average 65-minutes to process one S1-GUNW data product. At the beginning of topsApp PGE, it checks if the expected output already exists, if so, it exits immediately. This use case for sdswatch would emit sdswatch metric for dataset existence and short-circuiting to be reporting.
Use case 5: metrics analysis via Kibana
Enable the visualization of key/value metrics first from aggregate across all workers and components. This enables viewing statistics of key/values such as min/mean/max values of a reported metric. Then facet into one compute node into a worker to see metrics just for that worker Then facet onto one single metric to see its value reported over time.
Dashboard panels to support faceting:
keys over time
values over time
table distribution of keys
table distribution of values
table distribution of IP addresses
table distribution of component IDs
Use case 6: metrics collection for verdi events e.g. hariki
Verdi job worker has many states that can be reported to sdswatch for analysis in real-time. e.g. hariki, job states, etc. verdi could update sdswathc logs for enabling insights into job worker events.
How to use SDSWatch as a client ?
To gain insight through Kibana, first you need to emit logs in format, with name, and at directory that SDSWatch requires.
Generic SDSWatch log
a. These logs typically will be in the log dir managed by supervisord: /home/ops/verdi/log/<name>.sdswatch.log
b. Naming your sdswatch log as <name>.sdswatch.log
c. Format: <timestamp iso 8601>, <host>, <source_type>, <source_id>, <key>, <value>2020-04-09T01:38:22+0000 , http://e-jobs.aria.hysds.io:15672 , rabbitmq , spyddder-sling-extract-asf , state, running 2020-04-09T01:38:22+0000 , http://e-jobs.aria.hysds.io:15672 , rabbitmq , spyddder-sling-extract-asf , ready, 1 2020-04-09T01:38:22+0000 , http://e-jobs.aria.hysds.io:15672 , rabbitmq , spyddder-sling-extract-asf , unacked, 0 2020-04-09T01:38:22+0000 , http://e-jobs.aria.hysds.io:15672 , rabbitmq , standard_product-s1gunw-topsapp-pleiades , state, running 2020-04-09T01:38:22+0000 , http://e-jobs.aria.hysds.io:15672 , rabbitmq , standard_product-s1gunw-topsapp-pleiades , ready, 51 2020-04-09T01:38:22+0000 , http://e-jobs.aria.hysds.io:15672 , rabbitmq , standard_product-s1gunw-topsapp-pleiades , unacked, 131
d. The format is inspired by commercial Splunk’s metrics approach. see https://docs.splunk.com/Documentation/Splunk/8.0.3/Metrics/Overview
e. This will be used for hysds core components (e.g. verdi, mozart, grq, etc.) logs
PGE SDSWatch log
a. These logs typically will be on verdi job worker: /data/work/job/<year>/<month>/<hour>/<minute>/<source_id>/<source_type>.pge.sdswatch.log
b. Naming your sdswatch log as <source_type>.pge.sdswatch.log
c. Format: <timestamp ISO 8601>, <key>, <value>
Note: Use double quote to allow comma within token value.
How to emit SDSWatch logs in Python with SDSWatchLogger package? (optional)
Download and install hysds-sdswatch
via pip
:
pip3 install git+https://github.com/hysds/hysds-sdswatch.git@master
Github repo: https://github.com/hysds/hysds-sdswatch
Importing SDSWatchLogger into your program
# For generic type from sdswatch.logger import SDSWatchLogger # For PGE type from sdswatch.pgelogger import PGESDSWatchLogger
Methods that SDSWatchLogger provides
For generic type:
from sdswatch.logger import SDSWatchLogger # you can only instantiate once logger = SDSWatchLogger(file_dir="/path/to/dir", name="logname", source_type="source_type", source_id="source_id") # to use the logger in other modules after the first instantiation # logger = SDSWatchLogger.get_logger() # to log logger.log(metric_key="key", metric_value="value")
For pge type:
from sdswatch.pgelogger import PGESDSWatchLogger # you can only instantiate once logger = PGESDSWatchLogger(file_dir="/path/to/dir", name="job_type") # to use the logger in other modules after the first instantiation # logger = PGESDSWatchLogger.get_logger() # to log logger.log(metric_key="key", metric_value="value")
Sample Python code for PGE type log
Note: the Logger will automatically create a new log file when instantiated.
example_main_module.py
from sdswatch.pgelogger import PGESDSWatchLogger from example_auxiliary_module import download sdsw_logger = PGESDSWatchLogger("/path/to/job/dir", "example_hello_world") if __name__ == "__main__": sdsw_logger.log("step", "pre-download") download() sdsw_logger.log("step", "post-download")
example_auxiliary_module.py
from sdswatch.pgelogger import PGESDSWatchLogger def download(): sdsw_logger = PGESDSWatchLogger.get_logger() sdsw_logger.log("step", "download")
example_hello_world.pge.sdswatch.log
'2020-05-25 01:52:40.569',step,pre-download '2020-05-25 01:52:40.570',step,download '2020-05-25 01:52:40.570',step,post-download
Guide for future development of SDSWatch
Reference Design for SDSWatch (Elastic stack)
On the client side: Filebeat is installed across multiple compute node, shipping data to the server side
On the server side: Logstash receives data from Filebeat, ships it to Elasticsearch database. Kibana is used for visualization
Current SDSWatch Design with regard to one compute node
Important Note: In the client side, Logstash is used here instead of Filebeat, and it needs to be replaced in the future. Even though Logstash is similar to Filebeat and provides more log processing capability that Filebeat doesn’t. Filebeat is much lighter weight and more suitable for the client side. We’re looking forward to removing logstash in the client side, and copy the current Logstash configuration in the client side to the logstash configuration in the server side.
Client side (on compute node): Logtash with appropriate configuration will listen to sdswatch log files at two locations
/data/work/jobs/<year>/<month>/<day>/<hour>/<minute>/<job_id>/<job_type>.pge.sdswatch.log for pge logs.
/home/ops/verdi/log/<name>.sdswatch.log for any other sdswatch logs (currently only job worker) on the compute node
Logstash will ship and process the these logs to SDSWatch server.
Server side: Redis, Logstash, Elasticsearch, Kibana
Redis is leveraged as broker transport for delivery to Elasticsearch. In the past, using Elasticsearch alone didn’t scale well when there were a large volume of logs coming in, and Redis is used to solve that problem.
Files and directories that are relevant to SDSWatch and how to run them
I first developed SDSWatch on hysdsops@100.67.35.12 for the server side and hysdsops@100.67.32.182 for the client side, so they may be still there.
On hysdsops@100.67.32.182 for the client side, here is my work:
/export/home/hysdsops/verdi/share/sdswatch-client
/export/home/hysdsops/verdi/etc/logstash.conf
/export/home/hysdsops/verdi/etc/filebeat.yml (not in the current SDSWatch-client implementation, but you may find it helpful when migrating Logstash to Filebeat on the client side)
/export/home/hysdsops/verdi/etc/filebeat-configs (not in the current SDSWatch-client implementation, but you may find it helpful when migrating Logstash to Filebeat on the client side)
On hysdsops@100.67.35.12 for the server side, here is my work:
/export/home/hysdsops/tmp/sdswatch-server
Note: in /export/home/hysdsops/tmp/sdswatch-server, there is a file .env with content
KIB=5601 REDIS=6379
To run SDSWatch client on hysdsops@100.67.32.182,
supervisor.d is already set up to run it automatically
check out section To check if the sdswatch-client is running correctly below.
To run SDSWatch server on hysdsops@100.67.35.12,
run “cd /export/home/hysdsops/tmp/sdswatch-server”, then run “bash sdswatch-server.sh”
Note: I used “chmod 777 -R data” for the data directory inside “sdswatch-server' since couldn’t find a way to make it work without using 777 mode 😞
After running 2 command lines above, here is the link to open Kibana dashboard: https://100.67.35.12:1502/app/kibana#/home?_g=()
Highly recommended: log in to both servers, play around with the current SDSWatch, and read all the code since there are not many. I recommend finding the corresponding files here https://github.com/hysds/hysds-sdswatch because there are more documentation. (tips: if you are reading the client side, find sdswatch-client.sh and read it first. If you are reading the server side, find sdswatch-server.sh and read it first. Ignore all relevant files to filebeat for now)
In case all the files are lost, here is how to set it up.
How to set up the server side
Go to https://github.com/hysds/hysds-sdswatch, open onserver directory and download all the files in it.
Log in to server hysdsops@100.67.35.12
Copy the sdswatch-server directory from your machine and put it in /export/home/hysdsops/tmp/ in the server. Also, in the /export/home/hysdsops/tmp/sdswatch-server/, create an empty data directory to save Elasticsearch data.
To start SDSWatch server, on the server’s terminal
cd /export/home/hysdsops/tmp/sdswatch-server
give read and write permission for container to to write into /export/home/hysdsops/tmp/sdswatch-server/data directory (“chmod 777 -R data” works, but you cannot do it because of security reason. I couldn’t think of ways to make it work without “chmod 777 -R data”)
bash sdswatch-server.sh
You can access Kibana through localhost:5601 of the server.
How to set up the client side
Go to https://github.com/hysds/hysds-sdswatch, open onclient directory and download all the files in it.
Log in to server hysdsops@100.67.32.182
Copy the sdswatch-client directory from your machine and put it in /export/home/hysdsops/verdi/share/ in the server. Also, in the /export/home/hysdsops/verdi/share/sdswatch-client/, create an empty data directory to save Logstash history (Don’t delete this directory when there are a lot of old sdswatch logs on your system. If you delete this directory, Logstash will resend old logs).
Copy the logstash.conf file and put it in /export/home/hysdsops/verdi/etc/
Notice that there are a few filebeat files and directories. Don’t touch it for now. I’ll mention them later.
How to add sdswatch-sclient to supervisor.d so supervisor.d will automatically start Logstash when the server boots up again.
Note for hysdsops@100.67.32.182: I already did it, but you may find it helpful to read it again.
supervisor.d
[program:sdswatch-client] directory=/export/home/hysdsops/verdi/share/sdswatch-client/ command=/export/home/hysdsops/verdi/share/sdswatch-client/sdswatch-client.sh process_name=%(program_name)s-%(process_num)02d priority=1 numprocs=1 numprocs_start=0 redirect_stderr=true stdout_logfile=%(here)s/../log/%(program_name)s-%(process_num)02d.log stdout_logfile_maxbytes=10MB stdout_logfile_backups=10 startsecs=10
Here is the step to check if sdswatch-client is running, run these command lines on the client side
supervisorctl status supervisorctl reread supervisorctl update
If you do it correctly, then run “supervisorctl update” again should give you the following:
(verdi) hysdsops@ip-100-67-32-182:~/verdi/etc$ supervisorctl status sdswatch-client:sdswatch-client-00 RUNNING pid 17993, uptime 0:08:23
if it’s not running, you can run “supervisorctl start sdswatch-client:sdswatch-client-00”
To check if the sdswatch-client is running correctly
On client server, run tail -f /export/home/hysdsops/verdi/log/sdswatch-client-00.log (If you look at the supervisor.d script above you’ll see this line “
stdout_logfile=%(here)s/../log/%(program_name)s-%(process_num)02d.log
“. This is where logstash directs its message)If the server hasn’t been opened yet or there are no logs to ship, you should see 2 errors (this is ok)
Unable to retrieve license information from license server {:message=>"No Available connections"}
Attempted to resurrect connection to dead ES instance, but got an error. {:url=>"http://elasticsearch:9200/", :error_type=>LogStash::Outputs::ElasticSearch::HttpClient::Pool::HostUnreachableError, :error=>"Elasticsearch Unreachable: [http://elasticsearch:9200/][Manticore::SocketException] Connection refused (Connection refused)"}
If there are logs to ship, you should see a log similar to this:
{ "source_id" => "user_rules_job", "host" => "https://100.67.33.56:15673", "metric_key" => "state", "log_path" => "/verdi/rabbitmq_queue_monitor_to_sdswatch-00.sdswatch.log", "sdswatch_timestamp" => 2020-05-29T02:30:18.000Z, "metric_value_float" => -1, "@version" => "1", "metric_value_string" => "running", "message" => "2020-05-29T02:30:18+00:00 , https://100.67.33.56:15673 , rabbitmq.queue , user_rules_job , state, running", "source_type" => "rabbitmq.queue" }
Next Step
Replace Logstash on the client side with Filebeat.
Move Logstash filtering code to server side
Test
On the server side, try to find a way to give read and write permission for the docker container to write into /export/home/hysdsops/tmp/sdswatch-server/data without using “chmod 777”
Install SDSWatch client on all compute nodes and test at scale.
Tips for migrating Logstash to Filebeat on client side
Understand how the client and server currently work first by playing around with it.
I recommend reading all the relevant files since there are not many (you should ignore all the Filebeat files on the client side for now) (tip: starting with the sdswatch-server.sh and sdswatch-client.sh). Try to understand the Logstash configuration file on the client side.
When migrating Logstash to Filebeat on the client side, the only thing you need to modify on the server side is the Logstash configuration (just adding a filter block between input block and output block).
I recommend trying out Filebeat on your local machine with SDSWatch logs first. Try using Filebeat to scoop up sdswatch logs and send it to console. Then look at the output logs printed in console, and investigate the fields inside the output logs. Then compare it with the assumed input in the current Logstash configuration in SDSWatch-client. Try to play around with Filebeat “add_field” feature and see if you can make the output logs from Filebeat to have the required information.
When you figure out how to make the output logs from Filebeat look right, check out filebeat.yml and filebeat-configs that I wrote that are currently in the system or on github. Using these files as a starting point. (Remember to always allow “live reloading” feature so we can always update it during production)
I also already wrote a configuration file to create docker container with Filebeat, you may want to log into hysdsops@100.67.32.182 and find the directory /export/home/hysdsops/verdi/share/filebeat-sdswatch-client. filebeat-sdswatch-client is similar to sdswatch-client but it’s for filebeat. However, when I ran it, there was an error that I couldn’t figure out why. This problem prevented me from migrating Logstash to Filebeat during my internship. Error when running Filebeat in docker: /usr/local/bin/docker-entrypoint: line 8: exec: filebeat: not found (I asked one of the people in Elastic forum, you may find it helpful)
The filebeat relevant files weren’t tested yet so read it with a grain of salt.
Resources you may find helpful
Logstash configuration: https://www.elastic.co/guide/en/logstash/current/configuration.html
Materials relevant to Environment variables in Logstash, Filebeat may be helpful.
Logstash directory layout on docker container: https://www.elastic.co/guide/en/logstash/current/dir-layout.html (you can find similar websites for Elasticsearch, Filebeat and Kibana)
Logstash file input: https://www.elastic.co/guide/en/logstash/current/plugins-inputs-file.html
Logstash redis input: https://www.elastic.co/guide/en/logstash/current/plugins-inputs-redis.html
Logstash filter: https://www.elastic.co/guide/en/logstash/current/filter-plugins.html
Basic filebeat configuration: https://www.elastic.co/guide/en/beats/filebeat/current/filebeat-configuration.html
Don’t fall into these pitfalls when using Filebeat. https://logz.io/blog/filebeat-pitfalls/
Filebeat processor: https://www.elastic.co/guide/en/beats/filebeat/current/filtering-and-enhancing-data.html (similar to Logstash filter). This can be very helpful to adjust Filebeat output logs before sending it to Logstash on the server side.
Filebeat log input: https://www.elastic.co/guide/en/beats/filebeat/current/filebeat-input-log.html
Docker materials relevant to volume mount, environment variable, and running command line inside docker container may be helpful.
Materials related to giving read and write permission to a docker container is very very helpful.
Materials related to save and load docker images are helpful when the compute node (e.g verdi, factotum) doesn’t have access to the internet.