Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Page Navigation:

Table of Contents


(blue star) Confidence Level Moderate  This article includes input from several JPLers. Multiple subject matter experts can indicate that a page is more frequently reviewed and updated.

Attached is a reference generic figaro trigger rules json (mainly for AWS) for handling various failed job scenarios for robustness. It employs the follow rules below.

Generic Trigger Rules

The following are generic trigger rules in Mozart for adding more resiliency by handling common failed jobs.

Name

Condition

Action

Notes

retry-failed-client_error

All job-failed containing string match “Client Error”

hysds-io-lw-mozart-retry

retry-failed-SoftTimeLimitExceeded-exception

All job-failed due to Mozart job exception field containing SoftTimeLimitExceeded()

hysds-io-lw-mozart-retry

retry failed could not connect to endpoint url

All job-failed with any error containing “Could not connect to the endpoint URL”

hysds-io-lw-mozart-retry

retry failed to download

All job-failed with any error containing “Failed to download”

hysds-io-lw-mozart-retry

retry-failed-generic_non_zero_exit_code_1

All job-failed due to “Got non-zero exit code: 1”

hysds-io-lw-mozart-retry

retry-failed-server_error

All job-failed containing string match “Server Error”

hysds-io-lw-mozart-retry

retry-failed-SoftTimeLimitExceeded-query

All job-failed due to Mozart job with query string match containing “SoftTimeLimitExceeded

hysds-io-lw-mozart-retry

retry-failed-CalledProcessError

All job-failed with “CalledProcessError"

hysds-io-lw-mozart-retry

retry-failed-nonzero_exit_code_125

All job-failed containing string match “Got non-zero exit code: 125”

hysds-io-lw-mozart-retry

retry-failed-too_many_requests_for_url

All job-failed with string match “Too Many Requests for url”

hysds-io-lw-mozart-retry

retry-failed-exit_code_143

All job-failed containing string match “Got non-zero exit code: 143”

hysds-io-lw-mozart-retry

This condition may be caused by incorrect configuration and use of Docker on smaller node instance types.

retry-job-offline

All job-offline

hysds-io-lw-mozart-retry

This condition occurs in low-level infrastructure failures such as when network goes down. The workers timeout and are marked by the system as offline. This trigger rule will retry the offline job.

  • note that it is possible for the offline worker to be still running offline and should network be reestablished, it is possible for the offline job to publish its final product results. A new ticket to check for offline and not publish may be possible.

  • For job retries, set a max limit of e.g. 5 or 10 times to prevent infinite retries. and add +1 priority in order to retain the relative ordering of jobs in system to complete. Otherwise, failed jobs will be at “the back of the line”.


(lightbulb) Have Questions? Ask a HySDS Developer:

Anyone can join our public Slack channelto learn more about HySDS. JPL employees can join #HySDS-Community

(blue star)

JPLers can also ask HySDS questions atStack Overflow Enterprise

(blue star)

Live Search
placeholderSearch HySDS Wiki