Welcome to the HySDS Wiki

HySDS: Hybrid-Cloud Science Data System

What is HySDS?

The Hybrid Cloud Science Data System, or HySDS (pronounced Hi-S-D-S) is an open source science data system that can be used to process large amounts of science data products. It has been used across several NASA missions to process from L0 to L2/L3 science data products for projects such as NISAR, SWOT, OPERA, MAAP, SMAP with HySDS (SWH), and OCO-2 L2FP.

What are some features of HySDS?

As a Hybrid Cloud system, HySDS can leverage both the processing power of the cloud and on-premise computing resources. This allows maximum flexibility as projects can leverage the elasticity of the cloud to scaled up or down as required in a cost efficient way, while leveraging on-premise compute for lower costs and more fixed workloads.

Processing

Operations

Community

UI/UX

Processing

Operations

Community

UI/UX

Large-scale processing

HySDS has been developed to support large-scale processing needs such as those needing to sustain 100TBs per day, and running at scales of 1000s of parallel compute nodes. HySDS has been undergone what the teams called “death valley” tests where it is run in extreme conditions to better understand loading. In some prior tests, it was load-tested to over 8,200 concurrent nodes (> 250K cores), and >3M processing jobs per day.

Processing in Volatile Computing Environments

A key feature is its intrinsic support of workers running in volatile ephemeral environments such as in AWS spot “market”, which is lower cost compute using excess computing capacity that can be terminated with 2-minutes warning in exchange for lower costs. Failed jobs are automatically detected and retried. Jobs on terminated nodes are automatically retried on other nodes. Workers can also detect if a compute node becomes impaired and causes rapid draining of jobs (job-drain detection).

Engineered for Large-Scale Robustness

API calls internal to HySDS and to AWS are design for large scale use where the “thundering herd” affect is mitigated by supporting exponential backup and jittering. These are approaches to enable software to scale up to thousands of parallel compute nodes. Architecturally, components where design for large distributed compute fleets. Performance tuning was done to enable further efficiencies.

On-Demand Processing

Separate from automated forward processing and bulk processing, HySDS supports invoking PGE algorithm steps to run on-demand via GUI interface (and API). Bulk invocations are natively supported where faceting on a collection of data, each granule can be used to submit an iteration of jobs for on-demand bulk processing.

Data Caching Per Worker

Each HySDS worker tracks data use and caches remote data urls where any recurrent data access becomes a cache-hit and avoids duplicate downloads.

Automated Checksum Validation for Data

If a collocated checksum file (e.g. md5) exists for data files that each worker stages in, it automatically uses the checksum file to verify the data file before continuing.

Multi-cloud

HySDS has been mainly used in hybrid AWS and on-premises. Prototypes have also been deployed across multi-cloud platforms such as Google Cloud Platform (GCP) and Azure.

Hybrid On-premise and HECC

The processing workers can be deployed on-premises on baremetal, VMs, and HECC. Several projects have ported HySDS to run with workers on HECC with PBS workers. Similarly, HySDS control nodes can also be deployed fully on-premises for more smaller and cost-constrained deployments.

Production Trigger Rules without Coding

Production rules can be expressed by simply navigating through faceted search in order to define powerful production trigger rules and their actions.

Jobs have Priorities

Unlike other SDSes and job management tools, HySDS supports jobs with priorities. This allows operations to more easily support higher priority processing (e.g. urgent response) without needed to create additional queues.

Job De-duplication

Similar jobs of same type and input parameters can be detected as duplicates so that the system does not run duplicate jobs repeatedly.

User Tags to Jobs and Data

Users can tag specific jobs and datasets in the faceted search in order to better organize, facet, and exchange for more efficient management and debugging of problematic jobs and data products. Then the tags become facets which can be shared and discovered by colleagues.

Errors/Tracebacks are streamed back to Operators

For better assessment of why jobs failed, errors/tracebacks are captured and sent back to a faceted search interface for operators. No need to hunt down log files to get quick assessments.

Real-time View of Each Job’s Work Dirs

Each job running on the distributed workers can have their live work dir be exposed as WebDAV link so that users can peer into the work dir content live from the browser while the job is running. Because it is WebDAV, it can also be mounted natively by major OSes as a remote drive.

Metrics for Runtime Analytics

Every science data processing job has key runtime performance metrics being captured for enabling more situational awareness of the performance of the system. This allows for analyzing performance for insights in data production, worker, computing/storage performance, and efficiency.

Versioned deployment of system and algorithms

All of the algorithms are encoded as Docker containers, versionized, and deployed to production in a way so as to better reproduce specific version releases of the software and data.

Cost-Production Models

After years of use across several projects doing data production, cost-production models are available and can be used to estimate data production times, volumes, compute resource sizes, storage, and cost.

Tools for Operations

Projects have contributed various tools for day-to-day operations use. For example, Elasticsearch backup tools, and data accountability/audit tools. A few projects have also added Slack-bots to check for service up-times. Watchdogs are also provided to check for anomalies such as timed out jobs, and orphaned data that can occur when running in volatile spot market environments.

Automated Smoke Tests from Repositories

NASA missions such as NISAR include weekly automated smoke tests, and SWOT does nightly smoke tests. These check out a fresh code, deploy a full system, and does mini end-to-end science data processing runs, to “smoke out” the nightly code updates.

Open Source

Development and features of HySDS has been done in the open source and coordinated across several projects with versioned releases.

Documentation

Collaborative online documentation in wiki as well as github repos.

Community Coordination

Features and bug fixes are coordinated on public JIRA. HySDS is being continuously improved/matured with feedback from the operations of NASA missions

Used by NASA missions, Earth Science data production, and R&A projects.

HySDS is used by NASA missions such as NISAR, SWOT, SMAP with HySDS (SWH), OCO-2 L2FP, OPERA, and Multi-Mission Algorithm and Analysis Platform (MAAP). HySDS has also been used for data production projects such as MEaSUREs, ASTER Volcano Archive (AVA), ARIA, as well as other research and analysis projects.

Standards and Interoperability

Ports of HySDS exists to expose the processing via OGC WPS and WPS-T. Ports also exists the expose a HySDS cluster as an OGC ADES service. HySDS also generates simple W3C PROV provenance traces of processing step.

Faceted Search of Jobs/Tasks/Workers/Events

Everything in HySDS is provided with faceted search, which provides a powerful way to interact with multi-dimensional data. Operators are provided with a faceted search UI/UX to resource management which show information about jobs, tasks, workers, and events. Operators can also tag specific jobs for operations book-keeping and mitigation.

Faceted Search of Data

All canonical data flowing through the system is accounted for, cataloged, and exposed with faceted tool. Users can also tag specific datasets for book-keeping, collaboration, and mitigation.

Faceted Search of Runtime Metrics

All distributed jobs emit real-time telemetry and are captured into Elasticsearch with Kibana metrics dashboard. This enables more real-time situational awareness of the system performance, problematic areas, and throughput analysis.

Dynamically-generated UI Interfaces for On-demand Processing of Algorithms

To support on-demand job submissions, dynamically generated GUI input forms are provided for each job type for ease of use.

Interface with Jupyter Notebook Front-ends

Projects have connected Jupyter notebook front-ends to invoke on-demand processing via API calls to HySDS as the back-end.

What about open source science?

HySDS was released as open source software in 2017 under the Apache License. Since then, HySDS has seen more community usage and contributions across several NASA missions. The primary source code can be found under https://github.com/hysds/ , with release updates available here: https://github.com/hysds/hysds-framework/releases . There is a community wiki at https://hysds-core.atlassian.net/. There are community Jira issues and feature requests posted at https://hysds-core.atlassian.net/jira/software/c/projects/HC/issues .

History

HySDS started development in 2007-2009 at the NASA Jet Propulsion Laboratory with various implementations from various ROSES-funded projects for climate sciences. But the focus on moving to the cloud really took share in 2011 when it was funded to move into AWS cloud to support large-scale InSAR processing with CSK and Sentinel-1A/B processing. Later it was adapter to support science data processing for A-Trail climate data collocation and data fusion. In 2015, it was ported to support OCO2’s L2 full-physics processing in AWS. There were other funded activities the ARIA project for hazards monitoring in AWS along with ML for data quality screening. Around 2016, HySDS was further matured and improved to support Getting Ready for NISAR (GRFN) where it was tested at larger scales (8,200+ concurrent nodes, 250K+ cores) per deployment, which is larger than Kubernetes. A MEaSUREs project on water vapor and cloud climatology data fusion of A-Train observations used HySDS for production where the entirety of the system was running on AWS spot market--including the control nodes. The MAAP project has been using HySDS to support on-demand processing and analysis. And NISAR and SWOT have been further iterating and maturing HySDS to support these larger-scale missions. In 2023, HySDS made it to its v5 release with more robustness and maturation. SWOT, OPERA, and SMAP also starting running operations with HySDS in 2023.

A Real World Example

On July 4th and 5th 2019, Ridgecrest California was hit with two large earthquakes, a magnitude 6.4 and magnitude 7.1. The largest earthquakes to hit the region in approximately 40 years.


On July 8th 2019, NASA's Advanced Rapid Image and Analysis (ARIA) team captured satellite RADAR data of the area and compared it against data of the same area captured just 3 months before. 
Using HySDS, the team was able to produce this image:

The map shows changes in surface soil, with the colors representing a level change of 4.8 inches higher or lower than the previous reading. Using this map, scientists in the field are able to assess damage in the area and track the fault lines that gave way. Projects like ARIA allow NASA to promote disaster prediction, preparation, response, and recovery. 

HySDS is the tool that produced these images. While there are plenty of Science Data Systems that can do similar jobs, HySDS provides a convenient, economical, and open source solution.

This collaborative site will be used to document and manage our HySDS projects including Deployment, Operations and Development of JPL's ARIA, NISAR and OCO-2 missions.

 

Have Questions? Ask a HySDS Developer

Join our public Slack channel to learn more about HySDS.


Some useful information for all visitors:


If you aren't sure where to start, check out our guides:


  • Browse HySDS documentation by keyword labels: