Page Navigation:

Highlights

Scale-up is done by an auto-scaling up metric alarm that checks the queue size.
Desirement to reduce auto-scaling AZ re-balancing which results in terminations. Therefore better to keep fleet balanced evenly across auto-scaling zones (AZ).
When using spot fleet, can ensure scale out in multiples of number of AZs.
Set scaling policy per ASG
- example: If alarm threshold is greater than 1 for greater than 60 seconds
  - Add 1 instance when JobsWaiting-grfn-job_worker-large is [1,10)
  - Add 10 instances when JobsWaiting-grfn-job_worker-large is [10,∞)

Optimization

Auto-scaling optimizations
- Currently public batch size 20-instance per 5-minute cool-down
- Cool-down default 300-seconds
Default internal batch rate of 10-instances per 30-seconds
AWS ASG will increase our max batch rate to 100-instances per 30-seconds
Could manually set desired group size to 100
Logs indicate our queue_size metric alarm only firing every 10-minutes.
Recommendations
- change cool-down to 1-minute
- try batch size of 100 instances
- set custom queue_size metric to check every 1-minute CloudWatch
if make these recommended changes, then estimate 1000 instances will take 55-minutes to ramp up

Self-Termination

Need to suspend auto-scaling group AZ for load balancing scale down
http://docs.aws.amazon.com/AutoScaling/latest/DeveloperGuide/US_SuspendResume.html
?Command for ASG to turn off AZ rebalancing
This only needs to be ran once per ASG and will show up in the details tab of the ASG

References

AWS Autoscaling

http://boto.readthedocs.org/en/latest/autoscale_tut.html

AWS CloudWatch

boto: Python interface to Amazon Web Services

>>> import boto.ec2.cloudwatch

>>> c = boto.ec2.cloudwatch.connect_to_region('us-west-2')

>>> metrics = c.list_metrics()

>>> metrics

[Metric:DiskReadBytes,

Metric:CPUUtilization,

Metric:DiskWriteOps,

Metric:DiskReadOps,

Metric:DiskReadBytes,

Metric:DiskReadOps,

Metric:CPUUtilization,

Metric:DiskWriteOps,

Metric:NetworkIn,

Metric:NetworkOut,

Metric:NetworkIn,

Metric:DiskReadBytes,

Metric:DiskWriteBytes,

Metric:NetworkIn,

Metric:NetworkOut,

Metric:DiskReadOps,

Metric:CPUUtilization,

Metric:DiskReadOps,

Metric:CPUUtilization,

Metric:DiskWriteBytes,

Metric:DiskReadBytes,

Metric:NetworkOut,

Metric:DiskWriteOps]

Code

http://gitlab:8000/browser/trunk/HySDS/cluster_fab/aria-jobs-dev/test_autoscale.py

📖 Related Articles:

Page:

2018-01-31 bulk reprocessing 50Gbps in trinity mode
Page:

Amazon Web Services Guide
Page:

CloudWatch custom metrics
Page:

Create Auto-Scaling Fleet Queue
Page:

Create AWS Autoscaling Group for Verdi
Page:

How to configure a cluster to use a docker registry
Page:

How to debug on an ASG worker node
Page:

How to manually ingest a dataset into GRQ/S3
Page:

HySDS AWS Autoscaling
Page:

HySDS GUI's Overview
Page:

Manipulation of AWS EC2 Instances, S3 Buckets, Autoscaling Groups (ASG)
Page:

Notes on installation

Have Questions? Ask a HySDS Developer:

Anyone can join our public Slack channel to learn more about HySDS. JPL employees can join #HySDS-Community

JPLers can also ask HySDS questions at Stack Overflow Enterprise

🚀 Page Information:

Was this page useful?

Yes No

Contribution History:

Topher Allen (1551 days ago)
Kate Sammons (1842 days ago)

Find an Error?

Is this document outdated or inaccurate? Please contact the assigned Subject Matter Expert:

Hook Hua

HySDS AWS Autoscaling

Highlights

Optimization

Self-Termination

References

AWS Autoscaling

AWS CloudWatch

boto: Python interface to Amazon Web Services

Code