Running HySDS Verdi Job Worker on Pleiades
HySDS PCM in AWS can also be used to control Verdi job workers in HECC cluster (e.g. Pleiades), AWS, on-premise. For Pleiades, the job workers can call home to PCM via VPN, ssh port tunnel, VPN, or AWS Transit Gateway.
The control of compute nodes in Pleiades is designed to parallel auto-scaling approach in AWS. We separate the queue from scaling logic, auto-scaler that sets desired nodes, from auto-scaling service, from the infrastructure itself. Here we use PBS as analogous to the auto-scaling service.
...
The verdi job workers on Pleiades then can call back to PCM in AWS via secured tunnels using direct ssh passthrough as suggested by NAS. https://www.nas.nasa.gov/hecc/support/kb/setting-up-ssh-passthrough_232.html
...
In this design, verdi job workers call home via secure ssh remote port tunnels through a head node to PCM components for rabbitmq, redis, ES, and REST API calls. But for all other larger traffic (e.g. AWS S3 transfer), the transfers are routed to a separate NAT head node
Compute node types available on Pleiades HECC
source: https://www.nas.nasa.gov/hecc/support/kb/pbs-resource-request-examples_188.html
Processor Type | SBU Rate/Node | Memory/Node |
Cascade Lake | 1.64 (40 cores/node) | 192 GB |
Skylake | 1.59 (40 cores/node) | 192 GB |
Broadwell | 1.00 (28 cores/node) | 128 GB |
Haswell | 0.80 (24 cores/node) | 128 GB |
Ivy Bridge | 0.66 (20 cores/node) | 64 GB |
Sandy Bridge | 0.47 (16 cores/node) | 32 GB |
...
Confidence Level High This article been formally reviewed and is signed off on by a relevant subject matter expert. |
---|
Intro
To leverage the computing power of pleiades, we port verdi to run bare-metal on pleiades and have it communicate with the HySDS cluster (a.k.a. PCM) on the Amazon Cloud to obtain jobs and submit results via the queues between the AWS Cloud and pleiades. The Docker virtualization system is considered unsafe for the shared supercomputer environment and therefore is not supported on pleiades. So verdi in its Docker form cannot be run directly on pleiades. Furthermore, all the PGEs in Docker containers have to be converted to singularity, another virtualization system that is allowed on pleiades. The porting of ARIA HySDS to pleiades consists of two major parts: (1) The container-builder. The Jenkins pipeline running on the ci (continuous integration) server triggers the new container-builder to, in addition to building the PGEs in the form of Docker containers, convert the Docker container into a singularity container, build a sandbox from the singularity container, and upload the singularity sandbox to the S3 storage; and (2) job_worker. The singularity extension dequeues jobs from Mozart on AWS, downloads the PGE in the form of a singularity sandbox from S3, forms the singularity exec command line, runs the PGE, and submit the results to GRQ on AWS.
Running HySDS Verdi Job Worker on Pleiades
HySDS PCM in AWS can also be used to control Verdi job workers in HECC cluster (e.g. Pleiades), AWS, on-premise. For Pleiades, the job workers can call home to PCM via VPN, ssh port tunnel, VPN, or AWS Transit Gateway.
The control of compute nodes in Pleiades is designed to parallel auto-scaling approach in AWS. We separate the queue from scaling logic, auto-scaler that sets desired nodes, from auto-scaling service, from the infrastructure itself. Here we use PBS as analogous to the auto-scaling service.
The verdi job workers on Pleiades then can call back to PCM in AWS via secured tunnels using direct ssh passthrough as suggested by NAS. https://www.nas.nasa.gov/hecc/support/kb/setting-up-ssh-passthrough_232.html
In this design, verdi job workers call home via secure ssh remote port tunnels through a head node to PCM components for rabbitmq, redis, ES, and REST API calls. But for all other larger traffic (e.g. AWS S3 transfer), the transfers are routed to a separate NAT head node
Verdi runs Singularity containers on Pleiades due to security restrictions on Pleiades not allowing Docker. see https://hysds-core.atlassian.net/wiki/spaces/HYS/pages/199852516/Job+worker-singularity
Compute node types available on Pleiades HECC
source: https://www.nas.nasa.gov/hecc/support/kb/pbs-resource-request-examples_188.html
Processor Type | SBU Rate/Node | Memory/Node |
Cascade Lake | 1.64 (40 cores/node) | 192 GB |
Skylake | 1.59 (40 cores/node) | 192 GB |
Broadwell | 1.00 (28 cores/node) | 128 GB |
Haswell | 0.80 (24 cores/node) | 128 GB |
Ivy Bridge | 0.66 (20 cores/node) | 64 GB |
Sandy Bridge | 0.47 (16 cores/node) | 32 GB |
Code Block |
---|
pfe21% node_stats.sh Node summary according to PBS: Nodes Available on Pleiades : 11137 cores: 239872 Nodes Available on Aitken : 1140 cores: 45600 Nodes Available on Electra : 3433 cores: 123532 Nodes Available on Merope : 1520 cores: 18240 Nodes Down Across NAS : 163 Nodes used/free by hardware type: SandyBridge cores/node:(16) Total: 1843, Used: 1628, Free: 215 IvyBridge (20) Total: 5204, Used: 5111, Free: 93 Haswell (24) Total: 2038, Used: 1959, Free: 79 Broadwell (28) Total: |
...
1990, Used: |
...
1794, Free: 196 |
...
Broadwell |
...
(Electra) (28) Total: 1149, Used: 1145, Free: 4 Skylake (Electra) (40) Total: 2284, Used: |
...
2250, Free: 34 Cascadelake (Aitken) (40) Total: 1140, Used: |
...
1048, Free: |
...
92 Nodes currently allocated to the |
...
gpu queue: |
...
Sandybridge Total: |
...
62, Used: |
...
3, Free: |
...
59 Skylake |
...
Total: |
...
17, Used: |
...
17, Free: |
...
0 Nodes currently allocated to the devel queue: SandyBridge Total: |
...
92, Used: |
...
21, Free: |
...
71 |
...
IvyBridge Total: |
...
732, Used: |
...
687, Free: |
...
45 Haswell |
...
|
...
Total: |
...
203, Used: |
...
129, Free: 74 |
...
Broadwell |
...
|
...
Total: |
...
286, Used: |
...
227, Free: |
...
59 |
...
Electra ( |
...
Broadwell) Total: 0, Used: 0, Free: 0 |
...
Electra (Skylake) Total: |
...
325, Used: 325, Free: 0 |
...
Aitken ( |
...
Cascadelake) Total: |
...
0, Used: |
...
0, Free: |
...
0 Merope nodes used/free by hardware type: Westmere (12) Total: 1520, Used: 727, Free: 793 Jobs on Pleiades are: requesting: 0 SandyBridge, 4712 IvyBridge, 812 Haswell, 2760 Broadwell, 275 Electra (B), 1558 Electra (S), 1629 Aitken (C) nodes using: 1628 SandyBridge, 5111 IvyBridge, 1959 Haswell, 1794 Broadwell, 1145 Electra (B), 2250 Electra (S), 1048 Aitken (C) nodes |
Queues for Pleiades, Aitken, and Electra Users
source: https://www.nas.nasa.gov/hecc/support/kb/pbs-job-queue-structure_187.html
qstat -Q
Code Block |
---|
Queue NCPUs/ Time/
name max/def max/def pr
======= =====/=== ======/====== ===
low --/ 8 04:00/ 00:30 -10
normal --/ 8 08:00/ 01:00 0
long --/ 8 120:00/ 01:00 0
debug --/ 8 02:00/ 00:30 15
devel --/ 1 02:00/ -- 149 |
Running Jobs Before Dedicated Time
source: https://www.nas.nasa.gov/hecc/support/kb/running-jobs-before-dedicated-time_306.html
The PBS batch scheduler supports a feature called shrink-to-fit (STF)
qsub -l min_walltime=6:00:00,max_walltime=168:00:00 job_worker-singularity.sh
This says if the Pleiades can fit 90-minute jobs, then go ahead and dispatch our PBS jobs for job_worker-singularity.sh
CLI to PBS
...
Quickly delete all running+queued jobs
qstat -r -q hysds > qstat.txt
PBS script
#PBS -l select=xx:ncpus=yy:model=zz
For Broadwell
#PBS -l select=10:ncpus=28:model=bro
For Haswell#PBS -l select=10:ncpus=24:mpiprocs=24:model=has
For Ivy Bridge#PBS -l select=12:ncpus=20:mpiprocs=20:model=ivy
For Sandy Bridge#PBS -l select=15:ncpus=16:mpiprocs=16:model=san
Running one verdi per compute node with shrink-to-fit
use select=1
for 1-node per verdi job, but submit many with shrink-to-fit.
qsub -l min_walltime=2:00:00,max_walltime=168:00:00 job_worker-singularity.sh
This requests Pleiades to schedule verdi if have even 2-hours available. That would be enough time to run one nominal processing job.
where the job_worker-singularity.sh contains:
#PBS -l select=1:ncpus=28:model=bro
What happens to the job worker when PBS kills the job?
On the verdi job worker that is running as a PBS job, when PBS kills the job (e.g. when the max time limit is reached), the verdi worker will gracefully exit. On HySDS PCM Mozrt/figaro, a WorkerLostError event is detected.
On the Pleiades verdi job worker log, we may see the following when PBS kills the job:
...
Setting hardware threads / cores for jobs
To complement the ncpu:N setting for PBS, can also export the environment variable OMP_NUM_THREADS
see https://www.nas.nasa.gov/hecc/support/kb/default-variables-set-by-pbs_189.html
Local RAM “drive” for faster scratch disk
On Pleiades, each compute node does not have any on-board disk storage (sImilar to AWS’s EBS-only EC2 instance types). Using this instead of NFS file system for work dir will significantly improve performance.
PBS jobs also have an available environment variable ${TMPDIR}
in PBS job, which defaults to /tmp/pbs.job_id
on the vnodes.
https://www.nas.nasa.gov/hecc/support/kb/pbs-environment-variables_178.html
NAS currently allocates about 50%-60% of node’s RAM to this ram drive. (it is observed that it varies by cluster)
Enable “auto-scaling”-like behavior with PBS
set_desire_worker.sh
mimics scale-out (scale up)
Code Block |
---|
#!/usr/bin/env bash
DESIRED=$1echo "# DESIRED: ${DESIRED}"
while true; doTIMESTAMP=$(date +%Y%m%dT%H%M%S)echo "$(date) checking qstat on hysds queue..."
# get count of running and queue jobs
TOKENS=$(qstat -q hysds | awk '{if ($1=="hysds") print $6 " " $7}')
IFS=" " read RUNNING QUEUED <<< ${TOKENS}
echo "# RUNNING: ${RUNNING}"
echo "# QUEUED: ${QUEUED}"
RUNNING_QUEUED=$((RUNNING + QUEUED))
echo "# RUNNING_QUEUED: ${RUNNING_QUEUED}"
if [ "${RUNNING_QUEUED}" -lt "${DESIRED}" ]; then
echo "# ---> qsub one more job..."
qsub -q hysds celery.pbs
fi
echo ""
sleep 60
done |
reference: https://www.nas.nasa.gov/hecc/support/kb/commonly-used-pbs-commands_174.html
Short-circuiting job worker if not enough time remaining to completed the next job
Recommendations for running on Pleiades:
Make fairly long time job requests
Short-circuiting job workers if not enough time remaining to completed the next jobs
At the beginning of a PBS job wrapper script (invoked by verdi job worker), check if the job has sufficient time allocation remaining to complete the max estimated duration of a job. This ensures that the running job has at least X hours left in the PBS job for the node running the job, and exiting the verdi job worker if the PBS job does not have sufficient time remaining to complete the estimated job duration.
Inside job worker, before the start of each job, delegate to check login on Pleiades if have enough time for at least the next job duration (2-hours). Can run qstat -f $PBS_JOBID and compare the output for resources_used.walltime and Resources_List.walltime to see how much time remains. If insufficient time, then the pbs wrapper script can sigterm gracefully the PID of verdi job worker to trigger verdi job worker to exit the process and therefore end the PBS node. have job worker gracefully exit.
Based on current metrics, topsapp-singularity generating one S1-GUNW takes a mean of 1.5 hours on Broadwell. So add some margin and use that as max. e.g. requiring at least 2-hours remaining for current PBS job. if not, then exit the PBS job.
Lei.Pan@jpl.nasa.gov (Unlicensed) TODO
Auto-exit of verdi job workers--harikiri_pid
see: https://jira.jpl.nasa.gov/projects/ARIA/issues/ARIA-291?filter=doneissues
Adapted harikiri to detect done workers on pleiades
This should be based on adapting existing harikri agent that runs as thread in each verdi job worker script and determines when job worker has not more jobs and then kills the job worker process.
Could be as simple as pbs job worker script:
run job worker script in background mode
save its PID
run hariki (blocks) # hariki detects for no more jobs after 10-min wait, then sigterm kills PID of job worker running in background.exit for PBS job to exit.
scale-down being handled inside job_worker-singularity via harikari-pid.py.
see https://github.com/hysds/job_worker-singularity
What happens to the job worker when PBS kills the job?
On the verdi job worker that is running as a PBS job, when PBS kills the job (e.g. when the max time limit is reached), the verdi worker will gracefully exit. On HySDS PCM Mozrt/figaro, a WorkerLostError event is detected.
On the Pleiades verdi job worker log, we may see the following when PBS kills the job:
Code Block |
---|
[2020-03-21 19:50:52,151: INFO/ForkPoolWorker-1] hysds.job_worker.run_job[f7aecba3-9fa6-4dd6-9886-a9e26ebc34c5]: cmdLine: /nasa/singularity/3.5.3/bin/singularity exec --no-home --home /home/ops --bind /nobackupp12/lpan/worker/workdir/2020/03/20/20200320T211158-pleiades_worker.8356683.pbspl1.nas.nasa.gov/jobs:/nobackupp12/lpan/worker/workdir/2020/03/20/20200320T211158-pleiades_worker.8356683.pbspl1.nas.nasa.gov/jobs --bind /nobackupp12/lpan/worker/workdir/2020/03/20/20200320T211158-pleiades_worker.8356683.pbspl1.nas.nasa.gov/tasks:/nobackupp12/lpan/worker/workdir/2020/03/20/20200320T211158-pleiades_worker.8356683.pbspl1.nas.nasa.gov/tasks --bind /nobackupp12/lpan/worker/workdir/2020/03/20/20200320T211158-pleiades_worker.8356683.pbspl1.nas.nasa.gov/workers:/nobackupp12/lpan/worker/workdir/2020/03/20/20200320T211158-pleiades_worker.8356683.pbspl1.nas.nasa.gov/workers --bind /nobackupp12/lpan/worker/workdir/2020/03/20/20200320T211158-pleiades_worker.8356683.pbspl1.nas.nasa.gov/cache:/nobackupp12/lpan/worker/workdir/2020/03/20/20200320T211158-pleiades_worker.8356683.pbspl1.nas.nasa.gov |
...
The two relevant log entries are:
Code Block |
---|
worker: Warm shutdown (MainProcess) |
Code Block |
---|
Task handler raised error: WorkerLostError('Worker exited prematurely: signal 15 (SIGTERM).') |
Local RAM “drive” for faster scratch disk
On Pleiades, each compute node does not have any on-board disk storage (sImilar to AWS’s EBS-only EC2 instance types). Using this instead of NFS file system for work dir will significantly improve performance.
PBS jobs also have an available environment variable ${TMPDIR}
in PBS job, which defaults to /tmp/pbs.job_id
on the vnodes.
...
/cache:ro --bind /home1/lpan/.netrc:/home/ops/.netrc:ro --bind /home1/lpan/.aws:/home/ops/.aws:ro --bind /home1/lpan/verdi/etc/settings.conf:/home/ops/ariamh/conf/settings.conf:ro --bind /home1/lpan/verdi/ops/hysds/celeryconfig.py:/celeryconfig.py:ro --bind /home1/lpan/verdi/etc/datasets.json:/datasets.json:ro --pwd /nobackupp12/lpan/worker/workdir/2020/03/20/20200320T211158-pleiades_worker.8356683.pbspl1.nas.nasa.gov/jobs/2020/03/22/02/41/standard_product-s1gunw-topsapp-singularity__standard-product_singularity_singularity-S1-GUNW-ifg-cfg-RM-M1S3-TN121-20200104T001038-20191223T000947-poeorb-57b1-20200320T214325.678639Z /nobackupp12/lpan/worker/workdir/2020/03/20/20200320T211158-pleiades_worker.8356683.pbspl1.nas.nasa.gov/cache/container-leipan_ariamh_standard-product_singularity-2020-03-13-4c4f48280c76.simg /home/ops/ariamh/interferogram/sentinel/create_standard_product_s1.sh
[2020-03-21 19:50:52,153: INFO/ForkPoolWorker-1] hysds.job_worker.run_job[f7aecba3-9fa6-4dd6-9886-a9e26ebc34c5]: Pre-processing steps all signaled continuation.worker: Warm shutdown (MainProcess)
[2020-03-21 21:13:53,899: ERROR/MainProcess] Process 'ForkPoolWorker-1' pid:37170 exited with 'signal 15 (SIGTERM)'
[2020-03-21 21:13:54,039: ERROR/MainProcess] Task handler raised error: WorkerLostError('Worker exited prematurely: signal 15 (SIGTERM).')
Traceback (most recent call last):
File "/home1/lpan/verdi/lib/python3.7/site-packages/celery/worker/worker.py", line 205, in start
self.blueprint.start(self)
File "/home1/lpan/verdi/lib/python3.7/site-packages/celery/bootsteps.py", line 119, in start
step.start(parent)
File "/home1/lpan/verdi/lib/python3.7/site-packages/celery/bootsteps.py", line 369, in start
return self.obj.start()
File "/home1/lpan/verdi/lib/python3.7/site-packages/celery/worker/consumer/consumer.py", line 318, in start
blueprint.start(self)
File "/home1/lpan/verdi/lib/python3.7/site-packages/celery/bootsteps.py", line 119, in start
step.start(parent)
File "/home1/lpan/verdi/lib/python3.7/site-packages/celery/worker/consumer/consumer.py", line 596, in start
c.loop(*c.loop_args())
File "/home1/lpan/verdi/lib/python3.7/site-packages/celery/worker/loops.py", line 83, in asynloop
next(loop)
File "/home1/lpan/verdi/lib/python3.7/site-packages/kombu/asynchronous/hub.py", line 306, in create_loop
events = poll(poll_timeout)
File "/home1/lpan/verdi/lib/python3.7/site-packages/kombu/utils/eventio.py", line 84, in poll
return self._epoll.poll(timeout if timeout is not None else -1)
File "/home1/lpan/verdi/lib/python3.7/site-packages/celery/apps/worker.py", line 284, in _handle_request
raise exc(exitcode)
celery.exceptions.WorkerShutdown: 0During handling of the above exception, another exception occurred:Traceback (most recent call last):
File "/home1/lpan/verdi/lib/python3.7/site-packages/billiard/pool.py", line 1267, in mark_as_worker_lost
human_status(exitcode)),
billiard.exceptions.WorkerLostError: Worker exited prematurely: signal 15 (SIGTERM).
zone: PST8PDT -------------- celery@pleiades_worker.8356683.pbspl1.nas.nasa.gov v4.4.0rc3 (cliffs)
---- **** -----
--- * *** * -- Linux-4.12.14-95.40.1.20191112-nasa-x86_64-with-SuSE-12-x86_64 2020-03-20 21:12:00
-- * - **** --- |
The two relevant log entries are:
Code Block |
---|
worker: Warm shutdown (MainProcess) |
Code Block |
---|
Task handler raised error: WorkerLostError('Worker exited prematurely: signal 15 (SIGTERM).') |
Adaptations of HySDS/Pleiades
ARIA HySDS running Sentinel-1A/B S1-GUNW processing on Pleiades