This is the version 1.0 (2019-05-17) Software Design Document (SDD) for the Hybrid Cloud Science Data System (HySDS).
Table of Contents | ||
---|---|---|
|
Introduction
Purpose
This document provides the high level design information to understand the architecture and design of the core HySDS framework. It is not intended to replace reading the code for details.
Background
The primary responsibility of a science data system (SDS) is to ingest science data as input and process them into higher level science data products as output.
...
key | constraint | type | description |
---|---|---|---|
command | required | string | executable path inside container |
params | required | array | list of param objects required to run this job (see param Object below) |
imported_worker_files | optional | object | mapping of host file/directory into container (see imported_worker_files Object below) |
dependency_images | optional | array | list of dependency image objects (see dependency_image Object below) |
recommended-queues | optional | array | list of recommended queues |
disk_usage | optional | string | minimum free disk usage required to run job specified as "\d+(GB|MB|KB)", e.g. "100GB", "20MB", "10KB" |
soft_time_limit | optional | int | soft execution time limit in seconds; worker will send a catchable exception to task to allow for cleanup before being killed; effectively a sigterm by the worker to the job; one caveat when determining the soft time limit of your job type: also include time for verdi operations such as docker image loading (on first job), input localization, dataset publishing, triage, etc. |
time_limit | optional | int | hard execution time limit in seconds; worker will send an uncatchable exception to the task and will force terminate it; effectively a sigkill by the worker to the job; one caveat when determining the hard time limit of your job type: make sure it's at least 60 seconds greater than the soft_time_limit otherwise job states will be orphaned in figaro |
pre | optional | array | list of strings specifying pre-processor functions to run; behavior depends on disable_pre_builtins; more info below on Preprocessor Functions |
disable_pre_builtins | optional | boolean | if set to true, default builtin pre-processors (currently [hysds.utils.localize_urls, hysds.utils.mark_localized_datasets, hysds.utils.validate_checksum_files]) are disabled and would need to be specified in pre to run; if not specified or set to false, list of pre-processors specified by pre is appended after the default builtins |
post | optional | array | list of strings specifying post-processor functions to run; behavior depends on disable_post_builtins; more info below on Postprocessor Functions |
disable_post_builtins | optional | boolean | if set to true, default builtin post-processors (currently [hysds.utils.publish_datasets]) are disabled and would need to be specified in post to run; if not specified or set to false, list of post-processors specified by post is appended after the default builtins |
...
key | key type | value | value type |
---|---|---|---|
path to file or directory on host | string | path to file or directory in container | string |
path to file or directory on host | string | one item list of path to file or directory in container | array |
path to file or directory on host | string | two item list of path to file or directory in container and mount mode: ro for read-only and rw for read-write (default is ro ) | array |
Syntax
Code Block |
---|
{ "command": "string", "recommended-queues": [ "string" ], "disk_usage":"\d+(GB|MB|KB)", "imported_worker_files": { "string": "string", "string": [ "string" ], "string": [ "string", "ro" | "rw" ] }, "dependency_images": [ { "container_image_name": "string", "container_image_url": "string" | null, "container_mappings": { "string": "string", "string": [ "string" ], "string": [ "string", "ro" | "rw" ] } } ], "soft_time_limit": int, "time_limit": int, "disable_pre_builtins": true | false, "pre": [ "string" ], "disable_post_builtins": true | false, "post": [ "string" ], "params": [ { "name": "string", "destination": "positional" | "localize" | "context" } ] } |
...
key | constraint | type | description |
---|---|---|---|
component | optional | string | component web interface to display this job type in (tosca or figaro); defaults to tosca |
label | optional | string | label to be used when this job type is displayed in web interfaces (tosca and figaro); otherwise it will show an automatically generated label based on the string after the "hysds.io.json." of the hysds-io file |
submission_type | optional | string | specifies if the job should be submitted once per product in query or once per job submission; iteration for a submission of the job for each result in the query or individual for a single submission; defaults to iteration |
enable_dedup | optional | boolean | set to true to enable job deduplication; false to disable; defaults to true |
action-type | optional | string | action type to expose job as; on-demand, trigger, or both; defaults to both |
allowed_accounts | optional | array | list of strings specifying account user IDs allowed to run this job type from the web interfaces (tosca and figaro); if not defined, ops account is the only account that can access this job type; if _all is specified in the list, then all users will be able to access this job type |
params | required | array | list of matching param objects from job-spec required to run this job (see params Object below) |
...
key | constraint | type | description |
---|---|---|---|
name | required | string | parameter name; should match corresponding parameter name in job-spec |
from | required | string | value for hard-coded parameter value, submitter for user submitted value, dataset_jpath:<jpath> to extract value from ElasticSearch result using a JPath-like notation (see from Specification below) |
value | required if from is set to value | string | hard-coded parameter value |
type | optional | string | possible values: text, number, datetime, date, boolean, enum, email, textarea, region, container_version, jobspec_version, hysdsio_version (see Valid Param Types below) |
default | optional | string | default value to use (must be string even if it's a number) |
optional | optional | boolean | parameter is optional and can be left blank |
placeholder | optional | string | value to use as a hint when displaying the form input |
enumerables | required if type is enum | array | list of string values to enumerate via a dropdown list in the form input |
lambda | optional | string | a lambda function to process the value during submission |
version_regex | required if type is container_version, jobspec_version or hysdsio_version | string | regex to use to filter on front component of respective container, job-spec or hysds-io ID; e.g. if type=jobspec_version, version_regex=job-time-series and list of all job-specs IDs in the system is ["job-time-series:release-20171103", "job-hello_world:release-20171103", job-time-series:master"], the list will be filtered down to those whose ID matched the regex job-time-series in the front component (string before the first ":"); in this example the resulting filtered set of release tags/branches will be listed as ["release-20171103", "master"] in a dropdown box; similar to type=enum and enumerable set to all release tags for a certain job type |
...
value | description |
---|---|
text | a text string, will be kept as text |
number | a real number |
date | a date in ISO8601 format: YYYY-MM-DD; will be treated as a "text" field for passing into the container |
datetime | a date with time in ISO8601 format: YYYY-MM-DDTHH:mm:SS.SSS; will be treated as a "text" field |
boolean | true or false in a drop down |
enum | one of a set of options in a drop down; must specify enumerables field to specify the list of possible options; these will be "text" types in the enumerables set |
an e-mail address, will be treated as "text" | |
textarea | same as text, but displayed larger with the textarea HTML tag |
region | auto-populated from the facet view leaflet tool |
container_version | a version of an existing container registered in the Mozart API; must define "version_regex" field |
jobspec_version | a version of an existing job-spec registered in the Mozart API; must define "version_regex" field |
hysdsio_version | a version of an existing hysds-io registered in the Mozart API; must define "version_regex" field |
...
- Look up the RangeBeginning and RangeEnding times to give to the L0B PGEs,
- Find all L0A_Prime files that overlap with the given RangeBeginning and RangeEnding times
- Associate the corresponding granule pass or tile with the given RangeBeginning and RangeEnding times for cataloging and file naming.
External Interface to CNES, GDS, and PO.DAAC
GDS and CNES push the incoming data files to SDS. PCM automatically monitors, detects, and ingests the defined data files at the inbound staging areas. The data files are checked for integrity before they’re ingested into PCM.
...
- Products are placed in the archive staging location by CNES.
- PCM generates a CNM-S message which indicates the location of the product. This message is sent to DAAC using AWS Kinesis.
- DAAC ingests the product and generates a CNM-R message, indicating success or failure of ingestion. This message is sent to PCM using an SNS topic which is connected to a AWS Lambda.
- PCM processes and stores CNM-R messages. These messages are cataloged and later queried for activities such as bulk reprociessing, where a large number of previously ingested products must be gathered from the DAAC.
For products generated by the PCM, messages are delivered to both the DAAC and CNES using the following process:
- PCM generates products
- PCM creates and copies tar’d SDP datasets to the PCM Archive Staging bucket outgoing directory.
- PCM creates CNM-S message for product
- PCM immediately creates message and sends to PO.DAAC
- PO.DAAC ingests tar file
- PO.DAAC returns a CNM-R msg of successful/failed ingestion to PCM SDS via an AWS message service
- For success msgs, PCM updates catalog
- For failed msgs, PCM SDS troubleshoots
- Every X minutes, PCM creates a single CNM-S msg with all the tar files placed in the Archive Bucket outgoing directory and places the CNM-S msg in the PCM Msg Bucket CNM-S directory
- CNES polls the PCM Msg Bucket CNM-S directory every Y minutes and pulls the msg(s).
- CNES deletes the message
- CNES pulls the tar files identified in a msg to CNES
- CNES creates one or two CNM-R messages with the success and/or failure of the tar copying activity and places the msg into the PCM Msg Bucket CNM-R directory.
- CNES notifies distribution center of product(s)
- PCM immediately creates message and sends to PO.DAAC
Metrics & Data Accounting
PCM manages an ElasticSearch database for tracking input products, intermediate products, and all reprocessing campaigns. Metrics collected by the PCM system include:
...
- Data Accountability
- Data Product Latency
- SDS system statistics
- Product delivery to PO.DAAC & CNES
Customized Operator Interfaces
The inherited PCM core capability provides helpful and easy-to-view web interfaces including the following:
...
Multiple concurrently operational pipelines for specific processing scenarios including Forward Processing (Persistent), Bulk Processing (Pre-defined input), and ADT Processing (Temporary). Pipelines are deployed in separate AWS accounts. This allows for isolation of permissions, costing, AWS account limits. Certain resources, such as S3 buckets, are shared between accounts. Permissions for read/write access are specific to account and are configureable. PCM's internal deployment and development process is described in the following diagram:
This diagram shows the interactions developers have with with the PCM DEV cluster and how that deployed system flows to I&T and OPS deployments. In this diagram PCM is using Terraform as the PCM Cluster Launcher. PCM's process for any new, operational deployment, is as follows:
- System engineer notifies SA team of plan to deploy PCM. SA team provides needed cloud credentials to deploy PCM configuration.
- Deployment engineer configures deployment by running PCM deployment configuration tool. Tool generates PCM deployment config.
- Can optionally re-use a pre-generated configuration.
- Terraform runs, deploying AWS infrastructure and provisioning PCM components within minutes. Pre-build AMIs are used for EC2 instantiation.
- PCM pulls docker images for PGEs
- Terraform completes, providing operator with URLs to operator web interfaces to verify deployment.
When deployed, PCM supports using Amazon EC2 Spot Instance workers:
...