Azure
Created by Hook Hua on Apr 11, 2016
Confidence Level TBD This article has not been reviewed for accuracy, timeliness, or completeness. Check that this information is valid before acting on it. |
---|
HySDS on Azure Classic - Testing Impressions and Results 2015-11 - 2016-02
Author: Michael Starch
Test Project: Hybrid Cloud Science Data System
Intro
This document contains general thoughts and impressions on using the Azure service to run HySDS processing. The purpose is to expose not just the “does it work” results of the testing, but also usage notes and trials encountered in order to allow upcoming projects to make decisions on using Azure for their cloud needs.
Getting Started
Setting up basic infrastructure on Azure turned out to be more complicated than expected but may be considerably simplified if the project can run on the provided images (Suse Linux, Windows). For HySDS, we were required to use an ITAR compliant CentOS image, thus we had to import our own image from the base clouse image available on centos.org. The process to import and startup an image is as follows:
Download Cloud CentOS image from centos.org
Import image into OpenStack, AWS, Hyper-V or another virtual service
Setup basic image
Make Azure-specific configuration changes
Install Azure-specific client
Capture image of system
Convert Azure-configured image to VHD format
Upload VHD as Page Blob
Modify Azure python code to force Page Blob format
-- OR --
(Untested) Use command line tools to upload as Page Blob
Startup instance using password access only (see security notes)
Using Packer
HySDS uses packer to allocate images on AWS and OpenStack. Thus it was a natural step to use packer to allocate our system. Packer, overall, worked as expected with a few small bugs that were easy to work around. Lamentably, these few bugs cost much time searching for the below workaround. In addition, during this time, Microsoft employees required access to our images and account in order to help debug (see security notes, and ITAR notes). Thus, ITAR testing was abandoned.
If using packer, you must use the image’s “description” field as the name of the base image not the image’s “name” field. Otherwise packer will fail. In addition, make sure your image is configured for tty-less sudo access.
Starting Up Images
Starting up images was straightforward and worked well except three caveats, which again cost time to discover. The first caveat is that the user must create a virtual network to hold the machines before instantiating any machines. The second caveat is that the user must use password access (see security notes). The third caveat is that “provisioning” of VMs off the packer images never finishes, even though the provisioned VMs are ready to use.
Running HySDS
Running HySDS on the deployed images worked without issue. Several small HySDS improvements have been submitted to issue tracking.
Autoscaling
Due to the nature of Azure autoscaling and HySDS processing, autoscaling did not work. Azure autoscaling works on a metric trigger such as percentage load of CPUs in the group, and starts up existing but suspended instances when that metric is reached.
Where HySDS needs to startup new instances based on internal queue sizes, it is hard to guarantee that CPU loads stay high enough, consistently enough, to trigger reasonable thresholds. In addition, autoscaling is capped at 50 machines. Each of these machines must be manually created and suspended in the correct group. This makes it nearly impossible to use autoscaling in a useful way for processing spikes.
In addition, Azure Storage cannot keep up with the load of ~50 machines writing to it. Often errors are received from storage due to moderate parallelism in writing and thus even if autoscaling were resolved scalability problems would still exist (see below).
Scalability
Azure has many problems with scalability. First the storage system breaks down at less than 100 concurrent running ingesters. Second, there appears to be a global lock on creating instances, making home-grown automatic scaling impossible. Lastly, there is a maximum of 50 VMs per autoscaling group preventing the user from using autoscaling to achieve scalability.
Security Notes
There appear to be a few critical security issues in Azure. SSH key access to the Azure machines we created does not work. The standard PEM file we used was reported to be in “invalid format”, not at upload time, but as a start-up error. Following Azure published instructions for ssh key access there is no error, but the key does not appear in the authorization file and access is impossible. When using a ssh-password fallback, Azure provisioning applies the password to the wrong user. The specified new user is not created, and the password is applied to whatever user exists currently on the system. If these issues are not resolved, Azure cannot be used for non-trivial projects.
ITAR security issues also exist (see below).
ITAR Issues
In order to overcome many issues in Azure, Microsoft reps ask for access to the account, storage, and VMs. Given that some of these reps are not US persons, this immediately violates ITAR requirements. Given the reps assigned to the HySDS project, we had to abandon ITAR sensitive testing in order to have the help needed to overcome all the issues.
Miscellaneous Notes
Tooling Requirements
In order to run Azure effectively, the user really needs all of the following tools as no one tool is effective.
Browser to use Portal
Azure CLI tools for Mac OS X or Linux
Windows with Hyper V or another VM cluster (AWS, OpenStack, etc.)
Azure PowerShell tools for Windows
Upcoming Releases
Also, the Microsoft reps keep suggesting that all problems will be resolved in the new Azure version, which is a complete paradigm change from the system we tested. Problems may be fixed, but given the complete shift in paradigm we will be starting from zero and these notes will be invalidated. Complete testing of the new system will be needed.
Scoring and Functionality Chart
A rough numerical scoring of Azure, and a functionality chart.
Category | Score out of 100 |
Setting up VMs | 20 |
Provisioning VMs | 40 |
Running VMs | 70 |
Storage | 30 |
Ease of Use | 10 |
Scalability | 10 |
Monitoring | 50 |
Autoscaling | 20 |
Security | 0 |
ITAR Compliance (User perspective only) | 60 |
Category | Does it Work |
Importing Images and Starting VMs | X* |
Running VMs | X |
Monitoring VMs | X |
Autoscaling | X** |
Storage | X* |
Large Scale Processing |
* works for the most part but has some caveats or ease-of-use problems
** works as advertised but not an effective tool for HySDS nor similar projects
I would not recommend using Azure for new projects. The hidden costs of problematic security and ease-of-use will negate any direct savings from a lower priced system.
Related Articles: |
---|
Have Questions? Ask a HySDS Developer: |
Anyone can join our public Slack channel to learn more about HySDS. JPL employees can join #HySDS-Community
|
JPLers can also ask HySDS questions at Stack Overflow Enterprise
|
Page Information: |
---|
Was this page useful? |
Contribution History:
|
Subject Matter Expert: @Hook Hua |
Find an Error? Is this document outdated or inaccurate? Please contact the assigned Page Maintainer: @Hook Hua |