This is the version 1.0 (2019-05-17) Software Design Document (SDD) for the Hybrid Cloud Science Data System (HySDS).

Table of Contents

outline	true

Introduction

Purpose

This document provides the high level design information to understand the architecture and design of the core HySDS framework. It is not intended to replace reading the code for details.

Background

The primary responsibility of a science data system (SDS) is to ingest science data as input and process them into higher level science data products as output.

...

Before the advent of the cloud paradigm, the machinery for the SDS had mostly been hosted and operated on-premise at data centers owned and operated by the respective stakeholders. In this era of big data, larger requirements and cost constraints are forcing functions for moving the SDS as well as other associated data systems from the enterprise to the cloud. HySDS, the Hybrid Cloud Science Data System, was born in 2009 to fulfill this need for the Advanced Rapid Imaging and Analysis project (ARIA) and is subsequently used in various other projects today (OCO-2, SMAP, SWOT, NISAR, ARIA-SG, WVCC).

...

key	constraint	type	description
command	required	string	executable path inside container
params	required	array	list of param objects required to run this job (see param Object below)
imported_worker_files	optional	object	mapping of host file/directory into container (see imported_worker_files Object below)
dependency_images	optional	array	list of dependency image objects (see dependency_image Object below)
recommended-queues	optional	array	list of recommended queues
disk_usage	optional	string	minimum free disk usage required to run job specified as "\d+(GB\|MB\|KB)", e.g. "100GB", "20MB", "10KB"
soft_time_limit	optional	int	soft execution time limit in seconds; worker will send a catchable exception to task to allow for cleanup before being killed; effectively a sigterm by the worker to the job; one caveat when determining the soft time limit of your job type: also include time for verdi operations such as docker image loading (on first job), input localization, dataset publishing, triage, etc.
time_limit	optional	int	hard execution time limit in seconds; worker will send an uncatchable exception to the task and will force terminate it; effectively a sigkill by the worker to the job; one caveat when determining the hard time limit of your job type: make sure it's at least 60 seconds greater than the soft_time_limit otherwise job states will be orphaned in figaro
pre	optional	array	list of strings specifying pre-processor functions to run; behavior depends on disable_pre_builtins; more info below on Preprocessor Functions
disable_pre_builtins	optional	boolean	if set to true, default builtin pre-processors (currently [hysds.utils.localize_urls, hysds.utils.mark_localized_datasets, hysds.utils.validate_checksum_files]) are disabled and would need to be specified in pre to run; if not specified or set to false, list of pre-processors specified by pre is appended after the default builtins
post	optional	array	list of strings specifying post-processor functions to run; behavior depends on disable_post_builtins; more info below on Postprocessor Functions
disable_post_builtins	optional	boolean	if set to true, default builtin post-processors (currently [hysds.utils.publish_datasets]) are disabled and would need to be specified in post to run; if not specified or set to false, list of post-processors specified by post is appended after the default builtins

...

key	key type	value	value type
path to file or directory on host	string	path to file or directory in container	string
path to file or directory on host	string	one item list of path to file or directory in container	array
path to file or directory on host	string	two item list of path to file or directory in container and mount mode: `ro` for read-only and `rw` for read-write (default is `ro`)	array

Syntax

Code Block

{
  "command": "string",
  "recommended-queues": [ "string" ],
  "disk_usage":"\d+(GB|MB|KB)",
  "imported_worker_files": {
    "string": "string",
    "string": [ "string" ],
    "string": [ "string", "ro" | "rw" ]
  },
  "dependency_images": [
    {
      "container_image_name": "string",
      "container_image_url": "string" | null,
      "container_mappings": {
        "string": "string",
        "string": [ "string" ],
        "string": [ "string", "ro" | "rw" ]
      }
    }
  ],
  "soft_time_limit": int,
  "time_limit": int,
  "disable_pre_builtins": true | false,
  "pre": [ "string" ],
  "disable_post_builtins": true | false,
  "post": [ "string" ],
  "params": [
    {
      "name": "string",
      "destination": "positional" | "localize" | "context"
    }
  ]
}

...

key	constraint	type	description
component	optional	string	component web interface to display this job type in (tosca or figaro); defaults to tosca
label	optional	string	label to be used when this job type is displayed in web interfaces (tosca and figaro); otherwise it will show an automatically generated label based on the string after the "hysds.io.json." of the hysds-io file
submission_type	optional	string	specifies if the job should be submitted once per product in query or once per job submission; iteration for a submission of the job for each result in the query or individual for a single submission; defaults to iteration
enable_dedup	optional	boolean	set to true to enable job deduplication; false to disable; defaults to true
action-type	optional	string	action type to expose job as; on-demand, trigger, or both; defaults to both
allowed_accounts	optional	array	list of strings specifying account user IDs allowed to run this job type from the web interfaces (tosca and figaro); if not defined, ops account is the only account that can access this job type; if _all is specified in the list, then all users will be able to access this job type
params	required	array	list of matching param objects from job-spec required to run this job (see params Object below)

...

key	constraint	type	description
name	required	string	parameter name; should match corresponding parameter name in job-spec
from	required	string	value for hard-coded parameter value, submitter for user submitted value, dataset_jpath:<jpath> to extract value from ElasticSearch result using a JPath-like notation (see from Specification below)
value	required if from is set to value	string	hard-coded parameter value
type	optional	string	possible values: text, number, datetime, date, boolean, enum, email, textarea, region, container_version, jobspec_version, hysdsio_version (see Valid Param Types below)
default	optional	string	default value to use (must be string even if it's a number)
optional	optional	boolean	parameter is optional and can be left blank
placeholder	optional	string	value to use as a hint when displaying the form input
enumerables	required if type is enum	array	list of string values to enumerate via a dropdown list in the form input
lambda	optional	string	a Python lambda function outputting a processed/validated equivalent of the submitted value.
version_regex	required if type is container_version, jobspec_version or hysdsio_version	string	regex to use to filter on front component of respective container, job-spec or hysds-io ID; e.g. if type=jobspec_version, version_regex=job-time-series and list of all job-specs IDs in the system is ["job-time-series:release-20171103", "job-hello_world:release-20171103", job-time-series:master"], the list will be filtered down to those whose ID matched the regex job-time-series in the front component (string before the first ":"); in this example the resulting filtered set of release tags/branches will be listed as ["release-20171103", "master"] in a dropdown box; similar to type=enum and enumerable set to all release tags for a certain job type

...

value	description
text	a text string, will be kept as text
number	a real number
date	a date in ISO8601 format: YYYY-MM-DD; will be treated as a "text" field for passing into the container
datetime	a date with time in ISO8601 format: YYYY-MM-DDTHH:mm:SS.SSS; will be treated as a "text" field
boolean	true or false in a drop down
enum	one of a set of options in a drop down; must specify enumerables field to specify the list of possible options; these will be "text" types in the enumerables set
email	an e-mail address, will be treated as "text"
textarea	same as text, but displayed larger with the textarea HTML tag
region	auto-populated from the facet view leaflet tool
container_version	a version of an existing container registered in the Mozart API; must define "version_regex" field
jobspec_version	a version of an existing job-spec registered in the Mozart API; must define "version_regex" field
hysdsio_version	a version of an existing hysds-io registered in the Mozart API; must define "version_regex" field

...

Note
Note that any other PGE data files should be placed in the <Dataset ID> directory, as the whole directory is the dataset.

HySDS dataset and metadata JSON files
Anchor
dataset-and-metadata-json-files
dataset-and-metadata-json-files

dataset JSON file

A product must produce a <Dataset ID>.dataset.json in the <Dataset ID> directory. This file contains JSON formatted metadata representing the cataloged dataset metadata:

...

Acronym/Term	Description
ARIA	Advanced Rapid Imaging and Analysis
ARIA-SG	Advanced Rapid Imaging and Analysis (Earth Observatory of Singapore)
AWS	Amazon Web Services
DAAC	Data Active Archive Center
GCP	Google Cloud Platform
GDS	Ground Data System
HDDR	High Density Digital Recorder
HTTP/HTTPS	HyperText Transfer Protocol / HyperText Transfer Protocol Secure
HySDS	Hybrid Cloud Science Data System
IDP	Interim Digital SAR Processor
JPL	Jet Propulsion Laboratory
JSON	JavaScript Object Notation
NASA	National Aeronautics and Space Administration
NISAR	NASA-ISRO Synthetic Aperture Radar Mission
OCO-2	Orbiting Carbon Observatory 2
PGE	Product Generation Executor/Executable
SAR	Synthetic Aperture Radar
SDS	Science Data System
SMAP	Soil Moisture Active Passive Mission
SSL	Secure Sockets Layer
SWOT	Surface Water Ocean Topography Mission
WVCC	Water Vapor Cloud Climatology
XML	eXtensible Markup Language

...

Version	Old Version 3	New Version Current
Changes made by	Alex Dunn	Alex Dunn
Saved on	Sept 18, 2020	Mar 25, 2021

Versions Compared

Key

Introduction

Purpose

Background

Syntax

HySDS dataset and metadata JSON files
Anchor
dataset-and-metadata-json-files
dataset-and-metadata-json-files

dataset JSON file

Page Comparison

Versions Compared

Key

Introduction

Purpose

Background

Syntax

HySDS dataset and metadata JSON files Anchordataset-and-metadata-json-filesdataset-and-metadata-json-files

dataset JSON file

HySDS dataset and metadata JSON files
Anchor
dataset-and-metadata-json-files
dataset-and-metadata-json-files