Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

To leverage the computing power of pleiades, we port verdi to run bare-metal on pleiades and have it communicate with the HySDS cluster (a.k.a. PCM) on the Amazon Cloud to obtain jobs and submit results via the queues between the AWS Cloud and pleiades. The Docker virtualization system is considered unsafe for the shared supercomputer environment and therefore is not supported on pleiades. So verdi in its Docker form cannot be run directly on pleiades. Furthermore, all the PGEs in Docker containers have to be converted to singularity, another virtualization system that is allowed on pleiades. The porting of ARIA HySDS to pleiades consists of two major parts: (1) The container-builder. The Jenkins pipeline running on the ci (continuous integration) server triggers the new container-builder to, in addition to building the PGEs in the form of Docker containers, convert the Docker container into a singularity container, build a sandbox from the singularity container, and upload the singularity sandbox to the S3 storage; and (2) job_worker. The singularity extension dequeues jobs from Mozart on AWS, downloads the PGE in the form of a singularity sandbox from S3, forms the singularity exec command line, runs the PGE, and submit the results to GRQ on AWS.

Running HySDS Verdi Job Worker on Pleiades

HySDS PCM in AWS can also be used to control Verdi job workers in HECC cluster (e.g. Pleiades), AWS, on-premise. For Pleiades, the job workers can call home to PCM via VPN, ssh port tunnel, VPN, or AWS Transit Gateway.

...

In this design, verdi job workers call home via secure ssh remote port tunnels through a head node to PCM components for rabbitmq, redis, ES, and REST API calls. But for all other larger traffic (e.g. AWS S3 transfer), the transfers are routed to a separate NAT head node

Compute node types available on Pleiades HECC

source: https://www.nas.nasa.gov/hecc/support/kb/pbs-resource-request-examples_188.html

...

Code Block
pfe21% node_stats.sh

Node summary according to PBS:
 Nodes Available on Pleiades :  11137     cores: 239872
 Nodes Available on Aitken   :   1140     cores:  45600
 Nodes Available on Electra  :   3433     cores: 123532
 Nodes Available on Merope   :   1520     cores:  18240
 Nodes Down Across NAS       :    163

Nodes used/free by hardware type:
 SandyBridge cores/node:(16) Total:  1843, Used:  1628, Free:   215
 IvyBridge              (20) Total:  5204, Used:  5111, Free:    93
 Haswell                (24) Total:  2038, Used:  1959, Free:    79
 Broadwell              (28) Total:  1990, Used:  1794, Free:   196
 Broadwell (Electra)    (28) Total:  1149, Used:  1145, Free:     4
 Skylake   (Electra)    (40) Total:  2284, Used:  2250, Free:    34
 Cascadelake (Aitken)   (40) Total:  1140, Used:  1048, Free:    92

Nodes currently allocated to the gpu queue:
 Sandybridge                 Total:    62, Used:     3, Free:    59
 Skylake                     Total:    17, Used:    17, Free:     0

Nodes currently allocated to the devel queue:
 SandyBridge                 Total:    92, Used:    21, Free:    71
 IvyBridge                   Total:   732, Used:   687, Free:    45
 Haswell                     Total:   203, Used:   129, Free:    74
 Broadwell                   Total:   286, Used:   227, Free:    59
 Electra (Broadwell)         Total:     0, Used:     0, Free:     0
 Electra (Skylake)           Total:   325, Used:   325, Free:     0
 Aitken (Cascadelake)        Total:     0, Used:     0, Free:     0

Merope nodes used/free by hardware type:
 Westmere               (12) Total:  1520, Used:   727, Free:   793

Jobs on Pleiades are:
 requesting:     0 SandyBridge,  4712 IvyBridge,   812 Haswell,  2760 Broadwell,   
275 Electra (B),  1558 Electra (S),  1629 Aitken (C) nodes

      using:  1628 SandyBridge,  5111 IvyBridge,  1959 Haswell,  1794 Broadwell,  
1145 Electra (B),  2250 Electra (S),  1048 Aitken (C) nodes

Queues for Pleiades, Aitken, and Electra Users

source: https://www.nas.nasa.gov/hecc/support/kb/pbs-job-queue-structure_187.html

...

Code Block
Queue   NCPUs/      Time/
name      max/def    max/def    pr
======= =====/=== ======/====== ===
low        --/  8  04:00/ 00:30 -10
normal     --/  8  08:00/ 01:00   0
long       --/  8 120:00/ 01:00   0
debug      --/  8  02:00/ 00:30  15
devel      --/  1  02:00/    -- 149

Running Jobs Before Dedicated Time

source: https://www.nas.nasa.gov/hecc/support/kb/running-jobs-before-dedicated-time_306.html

...

This says if the Pleiades can fit 90-minute jobs, then go ahead and dispatch our PBS jobs for job_worker-singularity.sh

CLI to PBS

Quickly delete all running+queued jobs

qstat -r -q hysds > qstat.txt

PBS script

#PBS -l select=xx:ncpus=yy:model=zz

...

For Sandy Bridge
#PBS -l select=15:ncpus=16:mpiprocs=16:model=san

Running one verdi per compute node with shrink-to-fit

use select=1 for 1-node per verdi job, but submit many with shrink-to-fit.

...

#PBS -l select=1:ncpus=28:model=bro

Setting hardware threads / cores for jobs

To complement the ncpu:N setting for PBS, can also export the environment variable OMP_NUM_THREADS

see https://www.nas.nasa.gov/hecc/support/kb/default-variables-set-by-pbs_189.html

Local RAM “drive” for faster scratch disk

On Pleiades, each compute node does not have any on-board disk storage (sImilar to AWS’s EBS-only EC2 instance types). Using this instead of NFS file system for work dir will significantly improve performance.

...

https://www.nas.nasa.gov/hecc/support/kb/pbs-environment-variables_178.html

Enable “auto-scaling”-like behavior with PBS

set_desire_worker.sh

mimics scale-out (scale up)

...

reference: https://www.nas.nasa.gov/hecc/support/kb/commonly-used-pbs-commands_174.html

...

Short-circuiting job workers if not enough time remaining to completed the next jobs

Lei.Pan@jpl.nasa.gov (Unlicensed) TODO

Auto-exit of verdi job workers--harikiri_pid

see: https://jira.jpl.nasa.gov/projects/ARIA/issues/ARIA-291?filter=doneissues

Adapted harikiri to detect done workers on pleiades

This should be based on adapting existing harikri agent that runs as thread in each verdi job worker script and determines when job worker has not more jobs and then kills the job worker process.

Could be as simple as pbs job worker script:

run job worker script in background mode
save its PID
run hariki (blocks) # hariki detects for no more jobs after 10-min wait, then sigterm kills PID of job worker running in background.exit for PBS job to exit.

scale-down being handled inside job_worker-singularity via harikari-pid.py.

see https://github.com/hysds/job_worker-singularity

What happens to the job worker when PBS kills the job?

On the verdi job worker that is running as a PBS job, when PBS kills the job (e.g. when the max time limit is reached), the verdi worker will gracefully exit. On HySDS PCM Mozrt/figaro, a WorkerLostError event is detected.

...