...
#PBS -l select=1:ncpus=28:model=bro
What happens to the job worker when PBS kills the job?
On the verdi job worker that is running as a PBS job, when PBS kills the job (e.g. when the max time limit is reached), the verdi worker will gracefully exit. On HySDS PCM Mozrt/figaro, a WorkerLostError event is detected.
On the Pleiades verdi job worker log, we may see the following when PBS kills the job:
...
Setting hardware threads / cores for jobs
To complement the ncpu:N setting for PBS, can also export the environment variable OMP_NUM_THREADS
see https://www.nas.nasa.gov/hecc/support/kb/default-variables-set-by-pbs_189.html
Local RAM “drive” for faster scratch disk
On Pleiades, each compute node does not have any on-board disk storage (sImilar to AWS’s EBS-only EC2 instance types). Using this instead of NFS file system for work dir will significantly improve performance.
PBS jobs also have an available environment variable ${TMPDIR}
in PBS job, which defaults to /tmp/pbs.job_id
on the vnodes.
...
...
...
kb/pbs-environment-variables_178.html
What happens to the job worker when PBS kills the job?
On the verdi job worker that is running as a PBS job, when PBS kills the job (e.g. when the max time limit is reached), the verdi worker will gracefully exit. On HySDS PCM Mozrt/figaro, a WorkerLostError event is detected.
On the Pleiades verdi job worker log, we may see the following when PBS kills the job:
Code Block |
---|
[2020-03-21 19:50:52,151: INFO/ForkPoolWorker-1] hysds.job_worker.run_job[f7aecba3-9fa6-4dd6-9886-a9e26ebc34c5]: cmdLine: /nasa/singularity/3.5.3/bin/singularity exec --no-home --home /home/ops --bind /nobackupp12/lpan/worker/workdir/2020/03/20/20200320T211158-pleiades_worker.8356683.pbspl1.nas.nasa.gov/workersjobs:/nobackupp12/lpan/worker/workdir/2020/03/20/20200320T211158-pleiades_worker.8356683.pbspl1.nas.nasa.gov/workersjobs --bind /nobackupp12/lpan/worker/workdir/2020/03/20/20200320T211158-pleiades_worker.8356683.pbspl1.nas.nasa.gov/cachetasks:/nobackupp12/lpan/worker/workdir/2020/03/20/20200320T211158-pleiades_worker.8356683.pbspl1.nas.nasa.gov/cache:rotasks --bind /home1nobackupp12/lpan/.netrc:/home/ops/.netrc:ro --bind /home1/lpan/.aws:/home/ops/.aws:roworker/workdir/2020/03/20/20200320T211158-pleiades_worker.8356683.pbspl1.nas.nasa.gov/workers:/nobackupp12/lpan/worker/workdir/2020/03/20/20200320T211158-pleiades_worker.8356683.pbspl1.nas.nasa.gov/workers --bind /home1nobackupp12/lpan/verdiworker/etc/settings.conf:/home/ops/ariamh/conf/settings.confworkdir/2020/03/20/20200320T211158-pleiades_worker.8356683.pbspl1.nas.nasa.gov/cache:/nobackupp12/lpan/worker/workdir/2020/03/20/20200320T211158-pleiades_worker.8356683.pbspl1.nas.nasa.gov/cache:ro --bind /home1/lpan/verdi.netrc:/home/ops/hysds/celeryconfig.py:/celeryconfig.py.netrc:ro --bind /home1/lpan/.aws:/home/ops/.aws:ro --bind /home1/lpan/verdi/etc/datasetssettings.jsonconf:/datasets.json/home/ops/ariamh/conf/settings.conf:ro --pwdbind /nobackupp12home1/lpan/workerverdi/workdirops/2020/03/20/20200320T211158-pleiades_worker.hysds/celeryconfig.py:/celeryconfig.py:ro --bind /home1/lpan/verdi/etc/datasets.json:/datasets.json:ro --pwd /nobackupp12/lpan/worker/workdir/2020/03/20/20200320T211158-pleiades_worker.8356683.pbspl1.nas.nasa.gov/jobs/2020/03/22/02/41/standard_product-s1gunw-topsapp-singularity__standard-product_singularity_singularity-S1-GUNW-ifg-cfg-RM-M1S3-TN121-20200104T001038-20191223T000947-poeorb-57b1-20200320T214325.678639Z /nobackupp12/lpan/worker/workdir/2020/03/20/20200320T211158-pleiades_worker.8356683.pbspl1.nas.nasa.gov/cache/container-leipan_ariamh_standard-product_singularity-2020-03-13-4c4f48280c76.simg /home/ops/ariamh/interferogram/sentinel/create_standard_product_s1.sh [2020-03-21 19:50:52,153: INFO/ForkPoolWorker-1] hysds.job_worker.run_job[f7aecba3-9fa6-4dd6-9886-a9e26ebc34c5]: Pre-processing steps all signaled continuation.worker: Warm shutdown (MainProcess) [2020-03-21 21:13:53,899: ERROR/MainProcess] Process 'ForkPoolWorker-1' pid:37170 exited with 'signal 15 (SIGTERM)' [2020-03-21 21:13:54,039: ERROR/MainProcess] Task handler raised error: WorkerLostError('Worker exited prematurely: signal 15 (SIGTERM).') Traceback (most recent call last): File "/home1/lpan/verdi/lib/python3.7/site-packages/celery/worker/worker.py", line 205, in start self.blueprint.start(self) File "/home1/lpan/verdi/lib/python3.7/site-packages/celery/bootsteps.py", line 119, in start step.start(parent) File "/home1/lpan/verdi/lib/python3.7/site-packages/celery/bootsteps.py", line 369, in start return self.obj.start() File "/home1/lpan/verdi/lib/python3.7/site-packages/celery/worker/consumer/consumer.py", line 318, in start blueprint.start(self) File "/home1/lpan/verdi/lib/python3.7/site-packages/celery/bootsteps.py", line 119, in start step.start(parent) File "/home1/lpan/verdi/lib/python3.7/site-packages/celery/worker/consumer/consumer.py", line 596, in start c.loop(*c.loop_args()) File "/home1/lpan/verdi/lib/python3.7/site-packages/celery/worker/loops.py", line 83, in asynloop next(loop) File "/home1/lpan/verdi/lib/python3.7/site-packages/kombu/asynchronous/hub.py", line 306, in create_loop events = poll(poll_timeout) File "/home1/lpan/verdi/lib/python3.7/site-packages/kombu/utils/eventio.py", line 84, in poll return self._epoll.poll(timeout if timeout is not None else -1) File "/home1/lpan/verdi/lib/python3.7/site-packages/celery/apps/worker.py", line 284, in _handle_request raise exc(exitcode) celery.exceptions.WorkerShutdown: 0During handling of the above exception, another exception occurred:Traceback (most recent call last): File "/home1/lpan/verdi/lib/python3.7/site-packages/billiard/pool.py", line 1267, in mark_as_worker_lost human_status(exitcode)), billiard.exceptions.WorkerLostError: Worker exited prematurely: signal 15 (SIGTERM). zone: PST8PDT -------------- celery@pleiades_worker.8356683.pbspl1.nas.nasa.gov v4.4.0rc3 (cliffs) ---- **** ----- --- * *** * -- Linux-4.12.14-95.40.1.20191112-nasa-x86_64-with-SuSE-12-x86_64 2020-03-20 21:12:00 -- * - **** --- |
...
Code Block |
---|
Task handler raised error: WorkerLostError('Worker exited prematurely: signal 15 (SIGTERM).') |
Local RAM “drive” for faster scratch disk
On Pleiades, each compute node does not have any on-board disk storage (sImilar to AWS’s EBS-only EC2 instance types). Using this instead of NFS file system for work dir will significantly improve performance.
PBS jobs also have an available environment variable ${TMPDIR}
in PBS job, which defaults to /tmp/pbs.job_id
on the vnodes.
...