Page Navigation:

Table of Contents

Confidence Level Moderate This article includes input from several JPLers. Multiple subject matter experts can indicate that a page is more frequently reviewed and updated.

Attached is a reference generic figaro trigger rules json (mainly for AWS) for handling various failed job scenarios for robustness. It employs the follow rules below.

Generic Trigger Rules

The following are generic trigger rules in Mozart for adding more resiliency by handling common failed jobs.

Name	Condition	Action	Notes
retry-failed-client_error	All job-failed containing string match “Client Error”	hysds-io-lw-mozart-retry
retry-failed-SoftTimeLimitExceeded-exception	All job-failed due to Mozart job exception field containing SoftTimeLimitExceeded()	hysds-io-lw-mozart-retry
retry failed could not connect to endpoint url	All job-failed with any error containing “Could not connect to the endpoint URL”	hysds-io-lw-mozart-retry
retry failed to download	All job-failed with any error containing “Failed to download”	hysds-io-lw-mozart-retry
retry-failed-generic_non_zero_exit_code_1	All job-failed due to “Got non-zero exit code: 1”	hysds-io-lw-mozart-retry
retry-failed-server_error	All job-failed containing string match “Server Error”	hysds-io-lw-mozart-retry
retry-failed-SoftTimeLimitExceeded-query	All job-failed due to Mozart job with query string match containing “SoftTimeLimitExceeded”	hysds-io-lw-mozart-retry
retry-failed-CalledProcessError	All job-failed with “CalledProcessError"	hysds-io-lw-mozart-retry
retry-failed-nonzero_exit_code_125	All job-failed containing string match “Got non-zero exit code: 125”	hysds-io-lw-mozart-retry
retry-failed-too_many_requests_for_url	All job-failed with string match “Too Many Requests for url”	hysds-io-lw-mozart-retry
retry-failed-exit_code_143	All job-failed containing string match “Got non-zero exit code: 143”	hysds-io-lw-mozart-retry	This condition may be caused by incorrect configuration and use of Docker on smaller node instance types.
retry-job-offline	All job-offline	hysds-io-lw-mozart-retry	This condition occurs in low-level infrastructure failures such as when network goes down. The workers timeout and are marked by the system as offline. This trigger rule will retry the offline job. note that it is possible for the offline worker to be still running offline and should network be reestablished, it is possible for the offline job to publish its final product results. A new ticket to check for offline and not publish may be possible.

For job retries, set a max limit of e.g. 5 or 10 times to prevent infinite retries. and add +1 priority in order to retain the relative ordering of jobs in system to complete. Otherwise, failed jobs will be at “the back of the line”.

📖 Related Articles:

Filter by label (Content by label)