Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Page Navigation:

Table of Contents


(blue star) Confidence Level TBD  This article has not been reviewed for accuracy, timeliness, or completeness. Check that this information is valid before acting on it.

Premise

Some sling jobs get terminated (for a variety of reasons in AWS such as spot terminations and AZ load rebalancing terminations) during upload of data into S3. So it may not get a chance to complete the transaction. This leads to “orphaned” objects in the S3 bucket.

Assumptions

All HySDS "datasets" in S3 must be backed by an entry in GRQ/ES. And each "dataset" must be transactionally complete. The Osaka API writes "*.osaka.locked.json" files while it is being operated on. Thus the existence of any lock file shows that the dataset is either being written on, or potentially orphaned. If the lock file is say, greater than 1-day old, then it is most likely from an incomplete transactional write as most dataset uploads complete under a few minutes in S3.

Orphaned Dataset Scrubber Job

Runs daily to find and clean up old orphaned datasets. The job should already be crawling S3 to find orphaned "datasets" in S3 bucket and delete them from S3. This job is triggered by a daily cron submitter.

The output of the run is also published to its GRQ/ES instance and show up on tosca. For example: https://c-datasets.aria.hysds.io/tosca/.

On Tosca:

Code Block
type=result

...


dataset=orphaned_datasets_report

This implies updating datasets.json (plural) to support handling of new dataset=orphaned_datasets_report
The report should contain a log (in log or csv) of the orphaned datasets found and if it was cleaned up in S3 (if flag option to auto-purge was enabled).

For dataset ID tokens:

Code Block
orphaned_datasets_report-{timestamp}-{dataset_type}


Related to: "Orphaned S3 bucket cleanup PGE job" https://github.jpl.nasa.gov/hysds-org/general/issues/589 (previously internal ticket)

Public github repo: https://github.com/hysds/orphaned_datasets


Code Block
{
  "ipath": "ariamh::data/Orphaned_Datasets_Report",
  "match_pattern": "/(?P<id>orphaned_datasets_report-(?P<year>\\d{4})(?P<month>\\d{2})(?P<day>\\d{2})(?P<time>\\d{6})-.+)$",
  "alt_match_pattern": null,
  "extractor": null,
  "level": "l1",
  "type": "result",
  "publish": {
    "s3-profile-name": "default",
    "location": "s3://{{ DATASET_S3_ENDPOINT }}:80/{{ DATASET_BUCKET }}/datasets/{type}/{version}/{year}/{month}/{day}/{id}",
    "urls": [
      "http://{{ DATASET_BUCKET }}.{{ DATASET_S3_WEBSITE_ENDPOINT }}/datasets/{type}/{version}/{year}/{month}/{day}/{id}",
      "s3://{{ DATASET_S3_ENDPOINT }}:80/{{ DATASET_BUCKET }}/datasets/{type}/{version}/{year}/{month}/{day}/{id}"
    ]
  },
  "browse": {
    "location": "davs://{{ DAV_USER }}:{{ DAV_PASSWORD }}@{{ DAV_SERVER }}/browse/{type}/{version}/{year}/{month}/{day}/{id}",
    "urls": [
      "https://{{ DAV_SERVER }}/browse/{type}/{version}/{year}/{month}/{day}/{id}"
    ]
  }
}



(lightbulb) Have Questions? Ask a HySDS Developer:

Anyone can join our public Slack channelto learn more about HySDS. JPL employees can join #HySDS-Community

(blue star)

JPLers can also ask HySDS questions atStack Overflow Enterprise

(blue star)