Generic Trigger Rules for Mozart failed jobs

Page Navigation:

Page Navigation:

 

 


Confidence Level Moderate  This article includes input from several JPLers. Multiple subject matter experts can indicate that a page is more frequently reviewed and updated.

Confidence Level Moderate  This article includes input from several JPLers. Multiple subject matter experts can indicate that a page is more frequently reviewed and updated.

 

Attached is a reference generic figaro trigger rules json (mainly for AWS) for handling various failed job scenarios for robustness. It employs the follow rules below.

Generic Trigger Rules

The following are generic trigger rules in Mozart for adding more resiliency by handling common failed jobs.

Name

Condition

Action

Notes

Name

Condition

Action

Notes

retry-failed-client_error

All job-failed containing string match “Client Error”

hysds-io-lw-mozart-retry

 

retry-failed-SoftTimeLimitExceeded-exception

All job-failed due to Mozart job exception field containing SoftTimeLimitExceeded()

hysds-io-lw-mozart-retry

 

retry failed could not connect to endpoint url

All job-failed with any error containing “Could not connect to the endpoint URL”

hysds-io-lw-mozart-retry

 

retry failed to download

All job-failed with any error containing “Failed to download”

hysds-io-lw-mozart-retry

 

retry-failed-generic_non_zero_exit_code_1

All job-failed due to “Got non-zero exit code: 1”

hysds-io-lw-mozart-retry

 

retry-failed-server_error

All job-failed containing string match “Server Error”

hysds-io-lw-mozart-retry

 

retry-failed-SoftTimeLimitExceeded-query

All job-failed due to Mozart job with query string match containing “SoftTimeLimitExceeded

hysds-io-lw-mozart-retry

 

retry-failed-CalledProcessError

All job-failed with “CalledProcessError"

hysds-io-lw-mozart-retry

 

retry-failed-nonzero_exit_code_125

All job-failed containing string match “Got non-zero exit code: 125”

hysds-io-lw-mozart-retry

 

retry-failed-too_many_requests_for_url

All job-failed with string match “Too Many Requests for url”

hysds-io-lw-mozart-retry

 

retry-failed-exit_code_143

All job-failed containing string match “Got non-zero exit code: 143”

hysds-io-lw-mozart-retry

This condition may be caused by incorrect configuration and use of Docker on smaller node instance types.

retry-job-offline

All job-offline

hysds-io-lw-mozart-retry

This condition occurs in low-level infrastructure failures such as when network goes down. The workers timeout and are marked by the system as offline. This trigger rule will retry the offline job.

  • note that it is possible for the offline worker to be still running offline and should network be reestablished, it is possible for the offline job to publish its final product results. A new ticket to check for offline and not publish may be possible.

  • For job retries, set a max limit of e.g. 5 or 10 times to prevent infinite retries. and add +1 priority in order to retain the relative ordering of jobs in system to complete. Otherwise, failed jobs will be at “the back of the line”.



Related Articles:

Related Articles:

Have Questions? Ask a HySDS Developer:

Anyone can join our public Slack channel to learn more about HySDS. JPL employees can join #HySDS-Community

JPLers can also ask HySDS questions at Stack Overflow Enterprise

Search HySDS Wiki

Page Information:

Page Information:

Was this page useful?

Yes No

Contribution History:

Subject Matter Expert:

@Marjorie Lucas

@Hook Hua

Find an Error?

Is this document outdated or inaccurate? Please contact the assigned Page Maintainer:

@Marjorie Lucas

 

Note: JPL employees can also get answers to HySDS questions at Stack Overflow Enterprise: