Deployment Guide
Confidence Level TBD This article has not been reviewed for accuracy, timeliness, or completeness. Check that this information is valid before acting on it. |
---|
This document contains the recommendations for deploying, developing, and running HySDS. It is intended as a guide for projects to start understanding the trade-offs between various deployment strategies for hybrid-cloud computing with HySDS.
Deployment Guide#Public and Private Clouds
Public and Private Clouds
Definitions
Infrastructure Nodes: Nodes required by HySDS to run. These nodes run 24x7 and thus establish a static baseline for compute. They consist of: Mozart, Mertics, GRQ, and Factotum nodes.
Worker Nodes: Nodes that run PGEs in HySDS. These tend to be transient and are part of a dynamically-scaled fleet of workers.
Cloud Provider: Any entity providing hardware external to the project. e.g. Azure, Amazon Web Services, OCIO JPL Compute Cloud, Google, etc.
Hybrid Cloud: A cloud environment encompassing multiple sources for virtual machines. This includes cloud provider instances, local instances, and ITAR approved provider instances.
Instance Types
When using remote cloud providers, different types of instances may be available and serve different purposes when running HySDS. For simplicity, the Amazon Web Services (AWS) nomenclature is provided below. On premises refers to hardware provided locally to the project and is typically owned (and maintained) by the project.
On-Premises
Hardware owned and maintained by the project. This generally does not include OCIO provided hardware (as that would be on-demand).
On-Demand (Cloud Provider)
On demand instances are expensive, but are truly "as-needed". If you pay for an on-demand instance, you get that instance (subject to AWS's SLAs). These instances tend to be used for HySDS infrastructure (if it is run on cloud provider hardware) as they do not risk termination via market-forces. It should be noted that running these instances 24x7 can be expensive and thus the cost analysis of this verses on-premises hardware should be done. See "Cloud Provider Infrastructure, Cloud Provider Workers" below.
Spot Market (Cloud Provider)
Spot Market instances allow the user to bid on unused instances at AWS. Thus the costs are much lower than on-demand prices–it is not unusual to see spot instances cost 10% of on-demand costs. However, if the market price of the needed instance exceeds the user's bid then the AWS will terminate the user's running instance. Thus, it does not make sense to run infrastructure on spot market, because the cost of infrastructure failure is too high to justify the cost savings. Compute workers tend to be ideally suited for the spot market, as the cost of worker-failure is low, and automatic cleanup of workers is easiest.
Reserved Instances (Cloud Provider)
Reserved instances are very much like on-demand instances except that the prices is lower. However, the user is required to purchase the instance in 1 year increments. Thus, these instances make financial sense only if the user intends to run this instance for more than one year.
Cloud Regions
For AWS and other cloud providers such as Azure, they are separated into multiple cloud regions around the globe. This enables cloud deployments to run globally. Typically each region is located at a separate geographic area. Each region has multiple Availability Zones (AZ), which are separated enough to be independent and potentially allows the region to survive across a "blast radius". For AWS, each AZ is further divided up into multiple data centers.
Public Regions
Most regions are public regions that allow anyone to run any code. Currently AWS has two main large regions in the US, us-east-1 and us-west-2. The Bay area has a smaller region us-west-1.
Government Regions for ITAR/Export Control
For ITAR and export controlled software and/or data, one must run in a government region approved for ITAR/export controlled use. AWS has us-govcloud-1 in the Oregon/Seattle area.
FEATURE PARITY
Note that these govcloud regions are typically much smaller in capacity as well as higher costs as compared to public regions. There are also issues of feature parity where some services such as spot market are currently not available in govcloud. Together, these make govcloud more expensive than public regions to run in.
Deployment Configuration Best Practices
These best practices pertain to the deployment configuration of HySDS.
Making a Decision
The key decisions to make in order to pick a deployment configuration below are the following:
Will the project need a scalable worker fleet? The answer, if you are using HySDS, is most likely yes. However, if processing requirements are guaranteed to be fixed for the lifetime of the project, then a fixed worker fleet would suffice. However, a scalable fleet would still be useful for high resiliency as downtime in any worker can be compensated automatically by the auto-scaled fleet.
Does the cost of on-premises infrastructure (including maintenance costs, SA costs, etc) under-cut the cost of purchasing on-demand or reserved instances given the expected lifetime of the project? Remember, the cost of on-premises computing must include all costs related to the difficulty of maintaining project hardware. This includes total cost of ownership (TCO) such as facilities and infrastructure labor.
If the answer to #1 is yes, then Cloud Provider workers are required. If the answer to #2, is yes (on-premises is cheaper) then On Premises infrastructure is recommended. If the answer to #2 is no, then it is recommended that the user procure infrastructure nodes via a cloud provider.
A Note on Development Clusters
Projects generally need to provide clusters for development, PGE integration, and continuous integration purposes. Since these clusters tend to be smaller, they are traditionally run on-premises; however, the same cost analysis (see Making a Decision above) should be done for these clusters, as it may be cheaper (or less onerous) to run these clusters on Reserved instances or On Demand instances. Although typically these clusters have a fixed number of workers (usually 1) they may need to have a scalable worker-fleet and thus may be deployed using any of the Configurations below.
On Premises Infrastructure, On Premises Workers
This is the most basic of deployments where HySDS is entirely deployed on on-premises compute infrastructure. This means that the project deploying HySDS has to procure hardware to run all aspects of the system, and is therefore most similar to the old way of running data systems. The most obvious implication of this is that HySDS will not have dynamically scaling capability as this system is not running on any providers capable of providing "as-needed" compute.
Pros:
All costs are up-front
Cons:
No dynamic scaling capability
Ideal for:
Fixed compute/storage needs.
Services needing to run 24/7 and not elastic.
Lower durability.
On Premises Infrastructure, Cloud Provider Workers
This deployment configuration allows for infrastructure nodes to run on-premises while workers run on cloud-provider hardware. This allows the worker fleet to be dynamically scaled while keeping the constant baseline hardware for infrastructure on-premises. The implication of this is that infrastructure costs can be handled up-front (just like previous generation data systems) and only the compute for work is scaled. Workers likely run as spot-instances.
Pros:
Known infrastructure costs are up-front
Workers are allowed to dynamically scale to meet demand
Cons:
Infrastructure to worker communication must be explicitly handled. i.e. workers must be on JPL's network or, worker to infrastructure ports must be open, etc.
Worker cost is not up-front
Ideal for:
Balanced fixed cost infrastructure with elastic compute/storage needs
On Premises Infrastructure, On Premises Worker Baseline, Cloud Provider Additional Workers (Burst Processing)
This deployment is for projects that have a steady-state processing requirement that fits well on on-premises hardware but may need to burst to the cloud when doing non-predicted processing. OCO2's use of HySDS is similar to this approach as HySDS was used to add in as-needed resources for processing that was above the day-to-day processing that the on-promises cluster was sized for. Additional workers run as spot-instances.
Pros:
Known infrastructure and day-to-day worker costs are up-front
Workers may be dynamically scaled
Cons:
Infrastructure to worker communication must be explicitly handled. i.e. workers must be on JPL's network or, worker to infrastructure ports must be open, etc.
Worker fleet complexity is slightly higher to differentiate between burst, and not-burst processing
Ideal for:
Balanced fixed cost infrastructure and fixed baseline load, with elastic compute/storage needs
Cloud Provider Infrastructure, Cloud Provider Workers
This deployment model is for projects that wish to run entirely on cloud provider hardware (and not maintain operations hardware in-house). Typically this is done because of the difficulty of running and maintaining hardware on-premises. It may be that the OCIO has also encouraged a project to follow this approach. Infrastructure nodes should be run as on-demand or reserved instances, and workers are typically run as spot-instances.
Pros:
Hardware is not maintained by project
Infrastructure and workers are both on cloud-provider network
Cons:
Costs are not up-front and may be more expensive
Ideal for:
Best performance and scalability.
PGE Development Best Practices
PGEs as executables and/or Containers in compute instances
PGEs have been typically delivered as executables with configuration and input files. The premise of PGE invocation is that they are spoon-fed the input data into their unique work dir for each PGE runtime. The output data products are generated in the same work dir.
A general trend is to encapsulate the PGE and all of its dependencies into Containers (e.g. Docker). This helps to manage different PGEs and versions of the PGEs as containers. Consequently, these PGE containers are run inside a compute instance.
Another general strategy has been to have a one-to-one mapping of PGE to compute instance. This enables better sizing of instance type as well as metrics collection of that compute instance to be more representative of that PGE. Running multiple PGEs on one compute instance would couple many of these aspects together.
Abstraction from cloud
PGE developers need to be abstracted from cloud computing constructs. PGEs are intended to run inside a compute node (worker instance) and handle data products in local work directories. The Verdi pge-wrapper automatically downloads/localizes data products to local directories and will automatically publish/ingest/upload products found in the working directory. Thus the PGE itself need only run from a local directory and publish to that local directory.
PGE execution starts in a minimalist environment (similar to cron) and thus the PGE developer has the responsibility for defining all needed environment variables. This means that it is easier to ensure that all code, resources, and dependencies are shipped together as a unit in order to simplify the deployment and environment that needs to be set-up. This does not mean it is impossible to ship code, resources, and dependencies as separate units, but rather that it is harder to construct the environment properly. Typically, PGE wrapper shell scripts are used to setup the environment variable.
Another benefit of abstraction from cloud enables the PGEs to focus on algorithm and data processing.This decouples the PGE code from being coupled to any one source/sink storage type such as S3, S3-AI, NFS, WebDAV, etc. Moving the data push/pull our of PGE into the data system pge-wrapper enables PGEs to focus on processing and let the pge-wrapper perform data movement.
Credentials Handling
Security keys and other credentials should not be kept in source code CM as it increase the potential to accidentally commit those files to git.
All credentials should be kept in the developer's/ops' home directory. e.g. in .aws, .netrc, .boto, etc.
Cost Accounting Best Practices
It is important for the SDS manager and Ops lead to be cognizant of cloud computing costs on a daily and weekly basis. They need to monitor the bill closely, so as not to be surprised by a large bill at the end of the month. They also need to understand what they are being billed for and best practices for minimizing the bill.
It is also important to understand that as long as you are using any cloud computing service, you are being billed for it. So even though you may not be actively using the cloud, you may still have files stored in the cloud, machines that are suspended but not terminated, etc.
For example, cloud computing costs for Amazon Web Services include:
Storage costs (e.g. S3, S3-AI, Glacier)
Network egress costs
Inter-region transfer costs
Compute costs (e.g. EC2, ECS)
EC2-related costs
EBS volumes: Even if EC2 instance is in stopped state, you will be billed for EBS cached storage that was attached to EC2 instance
Project-Specific Compute Charging
With the delivery of HySDS V2, all compute is to be queue-agnostic. This means that the operator selects the queue that a given compute request will be run on (whether on demand, via a trigger rule, or via a workflow service like CWS).However, any queue can run any piece of compute (due to containerization). This means we can use different queues to establish different charging by setting up a procedure.
We establish a recommendation for setting up the available queues such that project-specific charging is easy to accomplish.
First the workers billed to a project should be configured such that any queue processed by that worker is prefixed with the project name i.e. "grfn-"
Any processing for a given project should be directed to queues prefixed with that project i.e. "grfn-" as submission time
Infrastructure queues, are run on infrastructure nodes, which are owned and charged to a single project. See note below.
(Optional) If a project desires multiple types of queues, that can be accomplished with the portion of the queue name after the prefix i.e. "grfn-small-jobs"
Note: Only compute is charged to specific projects. Infrastructure is charged to one project (or run on JPL-local servers). Projects needing to be completely separate must run separate data systems.
Now it can be seen that project-specific workers run from project-specific queues, and are charged to project-specific accounts. This efficiently separates charging for compute.
Related Articles: |
---|
Have Questions? Ask a HySDS Developer: |
Anyone can join our public Slack channel to learn more about HySDS. JPL employees can join #HySDS-Community
|
JPLers can also ask HySDS questions at Stack Overflow Enterprise
|
Page Information: |
---|
Was this page useful? |
Contribution History:
|
Subject Matter Expert: @Hook Hua |
Find an Error? Is this document outdated or inaccurate? Please contact the assigned Page Maintainer: @Hook Hua |