Generic Trigger Rules for Mozart failed jobs
Confidence Level Moderate This article includes input from several JPLers. Multiple subject matter experts can indicate that a page is more frequently reviewed and updated. |
---|
Attached is a reference generic figaro trigger rules json (mainly for AWS) for handling various failed job scenarios for robustness. It employs the follow rules below.
Generic Trigger Rules
The following are generic trigger rules in Mozart for adding more resiliency by handling common failed jobs.
Name | Condition | Action | Notes |
---|---|---|---|
retry-failed-client_error | All job-failed containing string match “Client Error” | hysds-io-lw-mozart-retry |
|
retry-failed-SoftTimeLimitExceeded-exception | All job-failed due to Mozart job exception field containing SoftTimeLimitExceeded() | hysds-io-lw-mozart-retry |
|
retry failed could not connect to endpoint url | All job-failed with any error containing “Could not connect to the endpoint URL” | hysds-io-lw-mozart-retry |
|
retry failed to download | All job-failed with any error containing “Failed to download” | hysds-io-lw-mozart-retry |
|
retry-failed-generic_non_zero_exit_code_1 | All job-failed due to “Got non-zero exit code: 1” | hysds-io-lw-mozart-retry |
|
retry-failed-server_error | All job-failed containing string match “Server Error” | hysds-io-lw-mozart-retry |
|
retry-failed-SoftTimeLimitExceeded-query | All job-failed due to Mozart job with query string match containing “SoftTimeLimitExceeded” | hysds-io-lw-mozart-retry |
|
retry-failed-CalledProcessError | hysds-io-lw-mozart-retry |
| |
retry-failed-nonzero_exit_code_125 | All job-failed containing string match “Got non-zero exit code: 125” | hysds-io-lw-mozart-retry |
|
retry-failed-too_many_requests_for_url | All job-failed with string match “Too Many Requests for url” | hysds-io-lw-mozart-retry |
|
retry-failed-exit_code_143 | All job-failed containing string match “Got non-zero exit code: 143” | hysds-io-lw-mozart-retry | This condition may be caused by incorrect configuration and use of Docker on smaller node instance types. |
retry-job-offline | All job-offline | hysds-io-lw-mozart-retry | This condition occurs in low-level infrastructure failures such as when network goes down. The workers timeout and are marked by the system as offline. This trigger rule will retry the offline job.
|
For job retries, set a max limit of e.g. 5 or 10 times to prevent infinite retries. and add +1 priority in order to retain the relative ordering of jobs in system to complete. Otherwise, failed jobs will be at “the back of the line”.
Related Articles: |
---|
Have Questions? Ask a HySDS Developer: |
Anyone can join our public Slack channel to learn more about HySDS. JPL employees can join #HySDS-Community
|
JPLers can also ask HySDS questions at Stack Overflow Enterprise
|
Page Information: |
---|
Was this page useful? |
Contribution History:
|
Subject Matter Expert: @Marjorie Lucas @Hook Hua |
Find an Error? Is this document outdated or inaccurate? Please contact the assigned Page Maintainer: @Marjorie Lucas |