Azure


Created by Hook Hua on Apr 11, 2016


Confidence Level TBD  This article has not been reviewed for accuracy, timeliness, or completeness. Check that this information is valid before acting on it.

Confidence Level TBD  This article has not been reviewed for accuracy, timeliness, or completeness. Check that this information is valid before acting on it.

 

HySDS on Azure Classic - Testing Impressions and Results 2015-11 - 2016-02

Author: Michael Starch

Test Project: Hybrid Cloud Science Data System


Intro

This document contains general thoughts and impressions on using the Azure service to run HySDS processing. The purpose is to expose not just the “does it work” results of the testing, but also usage notes and trials encountered in order to allow upcoming projects to make decisions on using Azure for their cloud needs.



Getting Started

Setting up basic infrastructure on Azure turned out to be more complicated than expected but may be considerably simplified if the project can run on the provided images (Suse Linux, Windows).  For HySDS, we were required to use an ITAR compliant CentOS image, thus we had to import our own image from the base clouse image available on centos.org. The process to import and startup an image is as follows:



  1. Download Cloud CentOS image from centos.org

  2. Import image into OpenStack, AWS, Hyper-V or another virtual service

  3.  

    1. Setup basic image

    2. Make Azure-specific configuration changes

    3. Install Azure-specific client

    4. Capture image of system

  4. Convert Azure-configured image to VHD format

  5. Upload VHD as Page Blob

  6.  

    1. Modify Azure python code to force Page Blob format

-- OR --

  1.  

    1. (Untested) Use command line tools to upload as Page Blob

  2. Startup instance using password access only (see security notes)



Using Packer

HySDS uses packer to allocate images on AWS and OpenStack. Thus it was a natural step to use packer to allocate our system. Packer, overall, worked as expected with a few small bugs that were easy to work around. Lamentably, these few bugs cost much time searching for the below workaround. In addition, during this time, Microsoft employees required access to our images and account in order to help debug (see security notes, and ITAR notes). Thus, ITAR testing was abandoned.



If using packer, you must use the image’s “description” field as the name of the base image not the image’s “name” field. Otherwise packer will fail.  In addition, make sure your image is configured for tty-less sudo access.



Starting Up Images

Starting up images was straightforward and worked well except three caveats, which again cost time to discover.  The first caveat is that the user must create a virtual network to hold the machines before instantiating any machines. The second caveat is that the user must use password access (see security notes). The third caveat is that “provisioning” of VMs off the packer images never finishes, even though the provisioned VMs are ready to use.



Running HySDS

Running HySDS on the deployed images worked without issue. Several small HySDS improvements have been submitted to issue tracking.



Autoscaling

Due to the nature of Azure autoscaling and HySDS processing, autoscaling did not work. Azure autoscaling works on a metric trigger such as percentage load of CPUs in the group, and starts up existing but suspended instances when that metric is reached.



Where HySDS needs to startup new instances based on internal queue sizes, it is hard to guarantee that CPU loads stay high enough, consistently enough, to trigger reasonable thresholds.  In addition, autoscaling is capped at 50 machines.  Each of these machines must be manually created and suspended in the correct group. This makes it nearly impossible to use autoscaling in a useful way for processing spikes.



In addition, Azure Storage cannot keep up with the load of ~50 machines writing to it.   Often errors are received from storage due to moderate parallelism in writing and thus even if autoscaling were resolved scalability problems would still exist (see below).



Scalability

Azure has many problems with scalability. First the storage system breaks down at less than 100 concurrent running ingesters. Second, there appears to be a global lock on creating instances, making home-grown automatic scaling impossible. Lastly, there is a maximum of 50 VMs per autoscaling group preventing the user from using autoscaling to achieve scalability.



Security Notes

There appear to be a few critical security issues in Azure.  SSH key access to the Azure machines we created does not work. The standard PEM file we used was reported to be in “invalid format”, not at upload time, but as a start-up error. Following Azure published instructions for ssh key access there is no error, but the key does not appear in the authorization file and access is impossible. When using a ssh-password fallback, Azure provisioning applies the password to the wrong user.  The specified new user is not created, and the password is applied to whatever user exists currently on the system. If these issues are not resolved, Azure cannot be used for non-trivial projects.



ITAR security issues also exist (see below).



ITAR Issues

In order to overcome many issues in Azure, Microsoft reps ask for access to the account, storage, and VMs. Given that some of these reps are not US persons, this immediately violates ITAR requirements. Given the reps assigned to the HySDS project, we had to abandon ITAR sensitive testing in order to have the help needed to overcome all the issues.



Miscellaneous Notes

Tooling Requirements

In order to run Azure effectively, the user really needs all of the following tools as no one tool is effective.

  1. Browser to use Portal

  2. Azure CLI tools for Mac OS X or Linux

  3. Windows with Hyper V or another VM cluster (AWS, OpenStack, etc.)

  4. Azure PowerShell tools for Windows



Upcoming Releases

Also, the Microsoft reps keep suggesting that all problems will be resolved in the new Azure version, which is a complete paradigm change from the system we tested. Problems may be fixed, but given the complete shift in paradigm we will be starting from zero and these notes will be invalidated.  Complete testing of the new system will be needed.





Scoring and Functionality Chart

A rough numerical scoring of Azure, and a functionality chart.





Category

Score out of 100

Setting up VMs

20

Provisioning VMs

40

Running VMs

70

Storage

30

Ease of Use

10

Scalability

10

Monitoring

50

Autoscaling

20

Security

0

ITAR Compliance (User perspective only)

60











Category

Does it Work

Importing Images and Starting VMs

X*

Running VMs

X

Monitoring VMs

X

Autoscaling

X**

Storage

X*

Large Scale Processing







* works for the most part but has some caveats or ease-of-use problems

** works as advertised but not an effective tool for HySDS nor similar projects



I would not recommend using Azure for new projects. The hidden costs of problematic security and ease-of-use will negate any direct savings from a lower priced system.

 


Related Articles:

Related Articles:

Have Questions? Ask a HySDS Developer:

Anyone can join our public Slack channel to learn more about HySDS. JPL employees can join #HySDS-Community

JPLers can also ask HySDS questions at Stack Overflow Enterprise

Page Information:

Page Information:

Was this page useful?

Yes No

Contribution History:

Subject Matter Expert:

@Hook Hua

Find an Error?

Is this document outdated or inaccurate? Please contact the assigned Page Maintainer:

@Hook Hua

Note: JPL employees can also get answers to HySDS questions at Stack Overflow Enterprise: