Operators can identify failed jobs using (TODO: maybe rephrase in more accessible language?) any precondition failure in the Resource Management interface (Figaro). This allows jobs to be processed in bulk using any of the lightweight job management functions found in the On-Demand window.
Instructions
TODO: Update step 1 to click on the Red failed job Tile along top bar. (It’s more reliable and user friendly) --per Andrew
Inside Figaro, first select the “job-failed” facet (See #1 in image) under the left-hand menu column labeled “status”. This will narrow the scope of jobs displayed within Figaro to only those with that status.
Note: the left-hand menu column dynamically updates according to the chosen facets. If an option is not visible confirm that any undesired facets aren’t selected by mistake.
2. Next, select the type of error. In this example, the error “SoftTimeLimitExceeded()” (#2) is faceted on.
3. Each facet appears across the top of the updated job results in a blue box (#3). After the desired job-failed facet is selected, the total number of matching jobs (#4) can be seen under the facet tags. Next, click the “On-Demand” button (#5) to select the desired lightweight job action to perform.
4. In the On-Demand window, add a unique user-defined tag (#6) and select the lightweight job to perform from the drop-down menu (#7).
5. When selecting the job from the Action drop-down menu,(TODO: confirm accuracy and wording of following text) its recommended to use the latest release of the desired job type. The release date is noted in square brackets following the job type name. In this example (#8) the “Purge jobs” action is chosen, and the latest release date is: [release-20180529].
6. Select “Process Now” to complete the selected task on the targeted failed jobs.