Page Navigation:

Page Navigation:
1 Generic Trigger Rules

Confidence Level Moderate This article includes input from several JPLers. Multiple subject matter experts can indicate that a page is more frequently reviewed and updated.

Confidence Level Moderate This article includes input from several JPLers. Multiple subject matter experts can indicate that a page is more frequently reviewed and updated.

Attached is a reference generic figaro trigger rules json (mainly for AWS) for handling various failed job scenarios for robustness. It employs the follow rules below.

Generic Trigger Rules

The following are generic trigger rules in Mozart for adding more resiliency by handling common failed jobs.

Name	Condition	Action	Notes

Name	Condition	Action	Notes
retry-failed-client_error	All job-failed containing string match “Client Error”	hysds-io-lw-mozart-retry
retry-failed-SoftTimeLimitExceeded-exception	All job-failed due to Mozart job exception field containing SoftTimeLimitExceeded()	hysds-io-lw-mozart-retry
retry failed could not connect to endpoint url	All job-failed with any error containing “Could not connect to the endpoint URL”	hysds-io-lw-mozart-retry
retry failed to download	All job-failed with any error containing “Failed to download”	hysds-io-lw-mozart-retry
retry-failed-generic_non_zero_exit_code_1	All job-failed due to “Got non-zero exit code: 1”	hysds-io-lw-mozart-retry
retry-failed-server_error	All job-failed containing string match “Server Error”	hysds-io-lw-mozart-retry
retry-failed-SoftTimeLimitExceeded-query	All job-failed due to Mozart job with query string match containing “SoftTimeLimitExceeded”	hysds-io-lw-mozart-retry
retry-failed-CalledProcessError	All job-failed with “CalledProcessError"	hysds-io-lw-mozart-retry
retry-failed-nonzero_exit_code_125	All job-failed containing string match “Got non-zero exit code: 125”	hysds-io-lw-mozart-retry
retry-failed-too_many_requests_for_url	All job-failed with string match “Too Many Requests for url”	hysds-io-lw-mozart-retry
retry-failed-exit_code_143	All job-failed containing string match “Got non-zero exit code: 143”	hysds-io-lw-mozart-retry	This condition may be caused by incorrect configuration and use of Docker on smaller node instance types.
retry-job-offline	All job-offline	hysds-io-lw-mozart-retry	This condition occurs in low-level infrastructure failures such as when network goes down. The workers timeout and are marked by the system as offline. This trigger rule will retry the offline job. note that it is possible for the offline worker to be still running offline and should network be reestablished, it is possible for the offline job to publish its final product results. A new ticket to check for offline and not publish may be possible.

For job retries, set a max limit of e.g. 5 or 10 times to prevent infinite retries. and add +1 priority in order to retain the relative ordering of jobs in system to complete. Otherwise, failed jobs will be at “the back of the line”.

Related Articles:

Related Articles:

Page:

"Hello World" Installation-GitHub
Page:

Beginner's Guide to HySDS
Page:

Cluster Setup - Installation-GitHub
Page:

Create Auto-Scaling Fleet Queue
Page:

Deploy PGE's onto Cluster
Page:

Filter on Failed Jobs due to Precondition Failure
Page:

Generic Trigger Rules for Mozart failed jobs
Page:

Hello Dataset- Installation- Github
Page:

How to enable docker execution stats for a job type
Page:

How to Purge Jobs
Page:

How to Retry Jobs
Page:

How to Revoke Jobs

Have Questions? Ask a HySDS Developer:

Anyone can join our public Slack channel to learn more about HySDS. JPL employees can join #HySDS-Community

JPLers can also ask HySDS questions at Stack Overflow Enterprise

Page Information:

Page Information:

Was this page useful?

Yes No

Contribution History:

Hook Hua (1078 days ago)
Topher Allen (1788 days ago)
Marjorie Lucas (2064 days ago)
Kate Sammons (2082 days ago)

Subject Matter Expert:

@Marjorie Lucas

@Hook Hua

Find an Error?

Is this document outdated or inaccurate? Please contact the assigned Page Maintainer:

@Marjorie Lucas

Generic Trigger Rules for Mozart failed jobs

Generic Trigger Rules