Orphaned Dataset Scrubber


Confidence Level TBD  This article has not been reviewed for accuracy, timeliness, or completeness. Check that this information is valid before acting on it.

Confidence Level TBD  This article has not been reviewed for accuracy, timeliness, or completeness. Check that this information is valid before acting on it.

Premise

Some sling jobs get terminated (for a variety of reasons in AWS such as spot terminations and AZ load rebalancing terminations) during upload of data into S3. So it may not get a chance to complete the transaction. This leads to “orphaned” objects in the S3 bucket.

Assumptions

All HySDS "datasets" in S3 must be backed by an entry in GRQ/ES. And each "dataset" must be transactionally complete. The Osaka API writes "*.osaka.locked.json" files while it is being operated on. Thus the existence of any lock file shows that the dataset is either being written on, or potentially orphaned. If the lock file is say, greater than 1-day old, then it is most likely from an incomplete transactional write as most dataset uploads complete under a few minutes in S3.

Orphaned Dataset Scrubber Job

Runs daily to find and clean up old orphaned datasets. The job should already be crawling S3 to find orphaned "datasets" in S3 bucket and delete them from S3. This job is triggered by a daily cron submitter.

The output of the run is also published to its GRQ/ES instance and show up on tosca. For example: https://c-datasets.aria.hysds.io/tosca/.

On Tosca:

type=result dataset=orphaned_datasets_report

This implies updating datasets.json (plural) to support handling of new dataset=orphaned_datasets_report
The report should contain a log (in log or csv) of the orphaned datasets found and if it was cleaned up in S3 (if flag option to auto-purge was enabled).

For dataset ID tokens:

orphaned_datasets_report-{timestamp}-{dataset_type}



Related to: "Orphaned S3 bucket cleanup PGE job" https://github.jpl.nasa.gov/hysds-org/general/issues/589 (previously internal ticket)

Public github repo: https://github.com/hysds/orphaned_datasets



{ "ipath": "ariamh::data/Orphaned_Datasets_Report", "match_pattern": "/(?P<id>orphaned_datasets_report-(?P<year>\\d{4})(?P<month>\\d{2})(?P<day>\\d{2})(?P<time>\\d{6})-.+)$", "alt_match_pattern": null, "extractor": null, "level": "l1", "type": "result", "publish": { "s3-profile-name": "default", "location": "s3://{{ DATASET_S3_ENDPOINT }}:80/{{ DATASET_BUCKET }}/datasets/{type}/{version}/{year}/{month}/{day}/{id}", "urls": [ "http://{{ DATASET_BUCKET }}.{{ DATASET_S3_WEBSITE_ENDPOINT }}/datasets/{type}/{version}/{year}/{month}/{day}/{id}", "s3://{{ DATASET_S3_ENDPOINT }}:80/{{ DATASET_BUCKET }}/datasets/{type}/{version}/{year}/{month}/{day}/{id}" ] }, "browse": { "location": "davs://{{ DAV_USER }}:{{ DAV_PASSWORD }}@{{ DAV_SERVER }}/browse/{type}/{version}/{year}/{month}/{day}/{id}", "urls": [ "https://{{ DAV_SERVER }}/browse/{type}/{version}/{year}/{month}/{day}/{id}" ] } }




Related Articles:

Related Articles:

Have Questions? Ask a HySDS Developer:

Anyone can join our public Slack channel to learn more about HySDS. JPL employees can join #HySDS-Community

JPLers can also ask HySDS questions at Stack Overflow Enterprise

Page Information:

Page Information:

Was this page useful?

Yes No

Contribution History:

Subject Matter Expert:

@Hook Hua

Find an Error?

Is this document outdated or inaccurate? Please contact the assigned Page Maintainer:

@Hook Hua

Note: JPL employees can also get answers to HySDS questions at Stack Overflow Enterprise: