...
- scale-up is done by an auto-scaling up metric alarm that checks the queue size.
- Desirement to reduce auto-scaling AZ re-balancing which results in terminations. Therefore better to keep fleet balanced evenly across auto-scaling zones (AZ).
- when using spot fleet, can ensure scale out in multiples of number of AZs.
- Set scaling policy per ASG
- example: If alarm threshold is greater than 1 for greater than 60 seconds
- Add 1 instance when JobsWaiting-grfn-job_worker-large is [1,10)
- Add 10 instances when JobsWaiting-grfn-job_worker-large is [10,∞)
- example: If alarm threshold is greater than 1 for greater than 60 seconds
Optimization
- auto-scaling optimizations
- currently public batch size 20-instance per 5-minute cool-down
- cool-down default 300-seconds
- default internal batch rate of 10-instances per 30-seconds
- AWS ASG will increase our max batch rate to 100-instances per 30-seconds
- could manually set desired group size to 100
- logs indicate our queue_size metric alarm only firing every 10-minutes.
- recommendations
- change cool-down to 1-minute
- try batch size of 100 instances
- set custom queue_size metric to check every 1-minute CloudWatch
- if make these recommended changes, then estimate 1000 instances will take 55-minutes to ramp up
self-termination
- need to suspend auto-scaling group AZ for load balancing scale down
- http://docs.aws.amazon.com/AutoScaling/latest/DeveloperGuide/US_SuspendResume.html
- command for ASG to turn off AZ rebalancing
aws autoscaling suspend-processes --auto-scaling-group-name ${yourASGname} --scaling-processes AZRebalance
- This only needs to be ran once per ASG and will show up in the details tab of the ASG
References
AWS Autoscaling
AWS CloudWatch
- http://docs.aws.amazon.com/AmazonCloudWatch/latest/DeveloperGuide/WhatIsCloudWatch.html
- http://docs.pythonboto.org/en/latest/cloudwatch_tut.html
boto: Python interface to Amazon Web Services
- https://github.com/boto/boto
- http://docs.pythonboto.org/en/latest/
- http://docs.pythonboto.org/en/latest/autoscale_tut.html
>>> import boto.ec2.cloudwatch >>> c = boto.ec2.cloudwatch.connect_to_region( 'us-west-2' ) >>> metrics = c.list_metrics() >>> metrics [Metric:DiskReadBytes, Metric:CPUUtilization, Metric:DiskWriteOps, Metric:DiskWriteOps, Metric:DiskReadOps, Metric:DiskReadBytes, Metric:DiskReadOps, Metric:CPUUtilization, Metric:DiskWriteOps, Metric:NetworkIn, Metric:NetworkOut, Metric:NetworkIn, Metric:DiskReadBytes, Metric:DiskWriteBytes, Metric:DiskWriteBytes, Metric:NetworkIn, Metric:NetworkIn, Metric:NetworkOut, Metric:NetworkOut, Metric:DiskReadOps, Metric:CPUUtilization, Metric:DiskReadOps, Metric:CPUUtilization, Metric:DiskWriteBytes, Metric:DiskWriteBytes, Metric:DiskReadBytes, Metric:NetworkOut, Metric:DiskWriteOps] |