This article outlines the nominal data flow of a job though HySDS.
{ 6/4 update diagram to remove elements pointing to delete. see notated image }
Submit a Job
A job is submitted via On-Demand or a Trigger Rule. Once submitted it moves to job-queued
status.
Job Queued
A queued job is checked to see if its running(?). If it is, the job is moved to job-started
status. If not, the job is revoked and changed to job-revoked
status. Revoked jobs are then deleted.
Job Started
Running jobs that have been moved to job-started
status are checked for two conditions: if they have timed out (via the watchdog check timedout
) and if they have succeeded (via the exit code).
If
timedout
: Jobs that have timed out are changed tojob-offline
status and then deleted.If succeeded with
exit code == 0
: Successfully completed jobs are updated tojob-completed
status. Finally, completed jobs are deleted.If succeeded with
exit code != 0
: Jobs with non-zero exit codes are associated with failed jobs. Their status is updated tojob-failed
and then deleted.
Job Deduped
Jobs that have identical parameters, or if the same job was already successfully completed, are deduped and no further processing occurs.
Tracking a Job
Jobs can be tracked through the job lifecycle via the payload ID. In Figaro, operators can facet on a job’s unique ID to monitor progression through the various stages as well as for troubleshooting functionality.
Job Completed
The PGE completed successfully.
Job Started
A worker node has started processing a job.
Job Offline
When the worker node is offline.