Page Navigation:

Table of Contents

Confidence Level TBD This article has not been reviewed for accuracy, timeliness, or completeness. Check that this information is valid before acting on it.

Premise

Some sling jobs get terminated (for a variety of reasons in AWS such as spot terminations and AZ load rebalancing terminations) during upload of data into S3. So it may not get a chance to complete the transaction. This leads to “orphaned” objects in the S3 bucket.

Assumptions

All HySDS "datasets" in S3 must be backed by an entry in GRQ/ES. And each "dataset" must be transactionally complete. The Osaka API writes "*.osaka.locked.json" files while it is being operated on. Thus the existence of any lock file shows that the dataset is either being written on, or potentially orphaned. If the lock file is say, greater than 1-day old, then it is most likely from an incomplete transactional write as most dataset uploads complete under a few minutes in S3.

Orphaned Dataset Scrubber Job

Runs daily to find and clean up old orphaned datasets. The job should already be crawling S3 to find orphaned "datasets" in S3 bucket and delete them from S3. This job is triggered by a daily cron submitter.

The output of the run is also published to its GRQ/ES instance and show up on tosca. For example: https://c-datasets.aria.hysds.io/tosca/.

On Tosca:

Code Block
type=result

...


dataset=orphaned_datasets_report

This implies updating datasets.json (plural) to support handling of new dataset=orphaned_datasets_report
The report should contain a log (in log or csv) of the orphaned datasets found and if it was cleaned up in S3 (if flag option to auto-purge was enabled).

For dataset ID tokens:

Code Block
orphaned_datasets_report-{timestamp}-{dataset_type}

Related to: "Orphaned S3 bucket cleanup PGE job" https://github.jpl.nasa.gov/hysds-org/general/issues/589 (previously internal ticket)

Public github repo: https://github.com/hysds/orphaned_datasets

Code Block

{
  "ipath": "ariamh::data/Orphaned_Datasets_Report",
  "match_pattern": "/(?P<id>orphaned_datasets_report-(?P<year>\\d{4})(?P<month>\\d{2})(?P<day>\\d{2})(?P<time>\\d{6})-.+)$",
  "alt_match_pattern": null,
  "extractor": null,
  "level": "l1",
  "type": "result",
  "publish": {
    "s3-profile-name": "default",
    "location": "s3://{{ DATASET_S3_ENDPOINT }}:80/{{ DATASET_BUCKET }}/datasets/{type}/{version}/{year}/{month}/{day}/{id}",
    "urls": [
      "http://{{ DATASET_BUCKET }}.{{ DATASET_S3_WEBSITE_ENDPOINT }}/datasets/{type}/{version}/{year}/{month}/{day}/{id}",
      "s3://{{ DATASET_S3_ENDPOINT }}:80/{{ DATASET_BUCKET }}/datasets/{type}/{version}/{year}/{month}/{day}/{id}"
    ]
  },
  "browse": {
    "location": "davs://{{ DAV_USER }}:{{ DAV_PASSWORD }}@{{ DAV_SERVER }}/browse/{type}/{version}/{year}/{month}/{day}/{id}",
    "urls": [
      "https://{{ DAV_SERVER }}/browse/{type}/{version}/{year}/{month}/{day}/{id}"
    ]
  }
}

📖 Related Articles:

Filter by label (Content by label)

showLabels	false
max	12
sort	title
showSpace	false
cql	label = "job_management"

Have Questions? Ask a HySDS Developer:

Anyone can join our public Slack channelto learn more about HySDS. JPL employees can join #HySDS-Community

JPLers can also ask HySDS questions atStack Overflow Enterprise

🚀 Page Information:

Was this page useful?

Yes No

Contribution History:

Contributors

mode	list
showLastTime	true
order	update

Subject Matter Expert:

Hook Hua

Find an Error?

Is this document outdated or inaccurate? Please contact the assigned Page Maintainer:

Hook Hua

Versions Compared

Old Version 3

New Version Current

Key

Premise

Assumptions

Orphaned Dataset Scrubber Job

On Tosca:

For dataset ID tokens:

Page Comparison

Versions Compared

Old Version 3

New Version Current

Key

Premise

Assumptions

Orphaned Dataset Scrubber Job

On Tosca:

For dataset ID tokens: