Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

#PBS -l select=1:ncpus=28:model=bro

What happens to the job worker when PBS kills the job?

On the verdi job worker that is running as a PBS job, when PBS kills the job (e.g. when the max time limit is reached), the verdi worker will gracefully exit. On HySDS PCM Mozrt/figaro, a WorkerLostError event is detected.

On the Pleiades verdi job worker log, we may see the following when PBS kills the job:

...

Setting hardware threads / cores for jobs

To complement the ncpu:N setting for PBS, can also export the environment variable OMP_NUM_THREADS

see https://www.nas.nasa.gov/hecc/support/kb/default-variables-set-by-pbs_189.html

Local RAM “drive” for faster scratch disk

On Pleiades, each compute node does not have any on-board disk storage (sImilar to AWS’s EBS-only EC2 instance types). Using this instead of NFS file system for work dir will significantly improve performance.

PBS jobs also have an available environment variable ${TMPDIR} in PBS job, which defaults to /tmp/pbs.job_id on the vnodes.

https://www.nas.nasa.gov/

...

hecc/

...

support/

...

kb/pbs-environment-variables_178.html

What happens to the job worker when PBS kills the job?

On the verdi job worker that is running as a PBS job, when PBS kills the job (e.g. when the max time limit is reached), the verdi worker will gracefully exit. On HySDS PCM Mozrt/figaro, a WorkerLostError event is detected.

On the Pleiades verdi job worker log, we may see the following when PBS kills the job:

Code Block
[2020-03-21 19:50:52,151: INFO/ForkPoolWorker-1] hysds.job_worker.run_job[f7aecba3-9fa6-4dd6-9886-a9e26ebc34c5]:  cmdLine: /nasa/singularity/3.5.3/bin/singularity exec --no-home --home /home/ops --bind /nobackupp12/lpan/worker/workdir/2020/03/20/20200320T211158-pleiades_worker.8356683.pbspl1.nas.nasa.gov/workersjobs:/nobackupp12/lpan/worker/workdir/2020/03/20/20200320T211158-pleiades_worker.8356683.pbspl1.nas.nasa.gov/workersjobs --bind /nobackupp12/lpan/worker/workdir/2020/03/20/20200320T211158-pleiades_worker.8356683.pbspl1.nas.nasa.gov/cachetasks:/nobackupp12/lpan/worker/workdir/2020/03/20/20200320T211158-pleiades_worker.8356683.pbspl1.nas.nasa.gov/cache:rotasks --bind /home1nobackupp12/lpan/.netrc:/home/ops/.netrc:ro --bind /home1/lpan/.aws:/home/ops/.aws:roworker/workdir/2020/03/20/20200320T211158-pleiades_worker.8356683.pbspl1.nas.nasa.gov/workers:/nobackupp12/lpan/worker/workdir/2020/03/20/20200320T211158-pleiades_worker.8356683.pbspl1.nas.nasa.gov/workers --bind /home1nobackupp12/lpan/verdiworker/etc/settings.conf:/home/ops/ariamh/conf/settings.confworkdir/2020/03/20/20200320T211158-pleiades_worker.8356683.pbspl1.nas.nasa.gov/cache:/nobackupp12/lpan/worker/workdir/2020/03/20/20200320T211158-pleiades_worker.8356683.pbspl1.nas.nasa.gov/cache:ro --bind /home1/lpan/verdi.netrc:/home/ops/hysds/celeryconfig.py:/celeryconfig.py.netrc:ro --bind /home1/lpan/.aws:/home/ops/.aws:ro --bind /home1/lpan/verdi/etc/datasetssettings.jsonconf:/datasets.json/home/ops/ariamh/conf/settings.conf:ro --pwdbind /nobackupp12home1/lpan/workerverdi/workdirops/2020/03/20/20200320T211158-pleiades_worker.hysds/celeryconfig.py:/celeryconfig.py:ro --bind /home1/lpan/verdi/etc/datasets.json:/datasets.json:ro --pwd /nobackupp12/lpan/worker/workdir/2020/03/20/20200320T211158-pleiades_worker.8356683.pbspl1.nas.nasa.gov/jobs/2020/03/22/02/41/standard_product-s1gunw-topsapp-singularity__standard-product_singularity_singularity-S1-GUNW-ifg-cfg-RM-M1S3-TN121-20200104T001038-20191223T000947-poeorb-57b1-20200320T214325.678639Z /nobackupp12/lpan/worker/workdir/2020/03/20/20200320T211158-pleiades_worker.8356683.pbspl1.nas.nasa.gov/cache/container-leipan_ariamh_standard-product_singularity-2020-03-13-4c4f48280c76.simg /home/ops/ariamh/interferogram/sentinel/create_standard_product_s1.sh
[2020-03-21 19:50:52,153: INFO/ForkPoolWorker-1] hysds.job_worker.run_job[f7aecba3-9fa6-4dd6-9886-a9e26ebc34c5]: Pre-processing steps all signaled continuation.worker: Warm shutdown (MainProcess)
[2020-03-21 21:13:53,899: ERROR/MainProcess] Process 'ForkPoolWorker-1' pid:37170 exited with 'signal 15 (SIGTERM)'
[2020-03-21 21:13:54,039: ERROR/MainProcess] Task handler raised error: WorkerLostError('Worker exited prematurely: signal 15 (SIGTERM).')
Traceback (most recent call last):
  File "/home1/lpan/verdi/lib/python3.7/site-packages/celery/worker/worker.py", line 205, in start
    self.blueprint.start(self)
  File "/home1/lpan/verdi/lib/python3.7/site-packages/celery/bootsteps.py", line 119, in start
    step.start(parent)
  File "/home1/lpan/verdi/lib/python3.7/site-packages/celery/bootsteps.py", line 369, in start
    return self.obj.start()
  File "/home1/lpan/verdi/lib/python3.7/site-packages/celery/worker/consumer/consumer.py", line 318, in start
    blueprint.start(self)
  File "/home1/lpan/verdi/lib/python3.7/site-packages/celery/bootsteps.py", line 119, in start
    step.start(parent)
  File "/home1/lpan/verdi/lib/python3.7/site-packages/celery/worker/consumer/consumer.py", line 596, in start
    c.loop(*c.loop_args())
  File "/home1/lpan/verdi/lib/python3.7/site-packages/celery/worker/loops.py", line 83, in asynloop
    next(loop)
  File "/home1/lpan/verdi/lib/python3.7/site-packages/kombu/asynchronous/hub.py", line 306, in create_loop
    events = poll(poll_timeout)
  File "/home1/lpan/verdi/lib/python3.7/site-packages/kombu/utils/eventio.py", line 84, in poll
    return self._epoll.poll(timeout if timeout is not None else -1)
  File "/home1/lpan/verdi/lib/python3.7/site-packages/celery/apps/worker.py", line 284, in _handle_request
    raise exc(exitcode)
celery.exceptions.WorkerShutdown: 0During handling of the above exception, another exception occurred:Traceback (most recent call last):
  File "/home1/lpan/verdi/lib/python3.7/site-packages/billiard/pool.py", line 1267, in mark_as_worker_lost
    human_status(exitcode)),
billiard.exceptions.WorkerLostError: Worker exited prematurely: signal 15 (SIGTERM).
zone:  PST8PDT -------------- celery@pleiades_worker.8356683.pbspl1.nas.nasa.gov v4.4.0rc3 (cliffs)
---- **** -----
--- * ***  * -- Linux-4.12.14-95.40.1.20191112-nasa-x86_64-with-SuSE-12-x86_64 2020-03-20 21:12:00
-- * - **** ---

...

Code Block
Task handler raised error: WorkerLostError('Worker exited prematurely: signal 15 (SIGTERM).')

Local RAM “drive” for faster scratch disk

On Pleiades, each compute node does not have any on-board disk storage (sImilar to AWS’s EBS-only EC2 instance types). Using this instead of NFS file system for work dir will significantly improve performance.

PBS jobs also have an available environment variable ${TMPDIR} in PBS job, which defaults to /tmp/pbs.job_id on the vnodes.

...