Skip to content

Commit

Permalink
Documentation updates
Browse files Browse the repository at this point in the history
- split the huge README into multiple files
- delete obsolete HTML from docs
  • Loading branch information
cristim committed Feb 21, 2017
1 parent 16b53ad commit b4938f0
Show file tree
Hide file tree
Showing 9 changed files with 332 additions and 1,355 deletions.
331 changes: 52 additions & 279 deletions README.md

Large diffs are not rendered by default.

12 changes: 6 additions & 6 deletions SETUP.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
# AutoSpotting Setup #

It's relatively easy to build and install your own version of this tool's
binaries, removing your dependency on the author's version, and allowing any
customizations and improvements your organization needs. You'll need to set up a
local environment to run Go, compile the binaries locally, upload them to an S3
bucket in your AWS account, and update the CloudFormation stack to use those new
binaries.
It's usually recommended to use the provided binaries, but in some cases you may
need to customize AutoSpotting for your own environment.

You'll need to set up a local environment able to compile Go code, compile the
binaries locally, upload them to an S3 bucket in your AWS account and update
your CloudFormation stack to use those new binaries.

## Dependencies ##

Expand Down
274 changes: 274 additions & 0 deletions TECHNICAL_DETAILS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,274 @@
# Technical Details #

## Features and Benefits ##

- **Significant cost savings compared to on-demand or reserved instances**
- up to 90% cost reduction compared to on-demand instances.
- up to 75% cost reduction compared to reserved instances, without any
down-payment or long term commitment.

- **Easy to install and set up on existing environments based on AutoScaling**
- you can literally get started within minutes.
- only needs to be installed once, in a single region, and can handle all
other regions without any additional configuration (but can also be
restricted to just a few regions if desired).
- easy to enable and disable for reverting to the initial configuration based
on resource tagging, if you decide you don't want to use it anymore.
- easy to automate migration of multiple existing stacks, simply using scripts
that set the expected tags on multiple AutoScaling groups.

- **Designed for use against AutoScaling groups with relatively long-running
instances**
- for use cases where it is acceptable to run on-demand instances from time to
time.
- for short-term batch processing use cases you should have a look into the
[spot
blocks](https://aws.amazon.com/blogs/aws/new-ec2-spot-blocks-for-defined-duration-workloads/)
instead.

- **It doesn't interfere with the group's original launch configuration**
- any instance replacement or scaling done by AutoScaling would still launch
your previously configured on-demand instances.
- on-demand instances often launch faster than spot ones so you don't need to
wait for potentially slower spot instance fulfilment when you need to scale
out or when you eventually lose some of the spot capacity.

- **Supports any higher level AWS services internally backed
by AutoScaling**
- services such as ECS or Elastic Beanstalk work out of the box with minimal
configuration changes or tweaks.

- **Compatible out of the box with most AWS services that integrate
with AutoScaling groups**
- services such as ELB, ALB, CodeDeploy, CloudWatch, etc. should work out of
the box or at most require minimal configuration changes.
- as long as they support instances attached later to existing groups.
- any other 3rd party services that run on top of AutoScaling groups should
work as well.

- **Can automatically replace any instance types with any instance types
available on the spot market**
- as long as they are cheaper and at least as big as the original instances.
- it doesn't matter if the original instance is available on the spot market:
for example it is often replacing t2.medium with better m4.large instances,
as long as they happen to be cheaper.

- **Self-hosted**
- has no runtime dependencies on external infrastructure except for the
regional EC2 and AutoScaling API endpoints.
- it's not a SaaS, it fully runs within your AWS account.
- it doesn't gather/persist/export any information about the resources running
in your AWS account.

- **Free and open source**
- there are no service fees at install time or run time.
- you only pay for the small runtime costs it generates.
- open source, so it is fully auditable and you can see the logs of everything
it does.
- the code is relatively small and simple so in case of bugs or missing
features you may even be able to fix it yourself.

- **Negligible runtime costs**
- you only pay for the bandwidth consumed performing API calls against AWS
services across different regions.
- backed by Lambda, with typical monthly execution time well within the Lambda
free tier plan.

- **Minimalist and simple implementation**
- currently about 1000 CLOC of relatively readable Golang code.
- stateless, and without many moving parts.
- leveraging and relying on battle-tested AWS services - namely AutoScaling -
for most mission-critical things, such as instance health checks, horizontal
scaling, replacement of terminated instances, integration with, ELB, ALB and
CloudWatch.

- **Relatively safe and secure**
- most runtime failures or crashes(quite rare nowadays) tend to be harmless.
- often only result in failing to start new spot instances so your group will
simply remain or fall back to on-demand capacity, just as it was before.
- in most cases it is not impacting your running instances nor the ability to
launch new ones.
- only needs the minimum set of IAM permissions needed for it to do its job.
- does not delegate any IAM permissions to resources outside of your AWS
account.
- execution scope can be limited to a certain set of regions.

- **Optimizes for high availability over cost whenever possible**
- it tries to diversify the instance types to reduce the chance of
simultaneous failures across the entire group. When having enough desired
capacity, it is often spreading over four different spot pricing zones
(instance type/availability zone combinations).
- supports keeping a configurable number of on-demand instances in the group,
either an absolute number or a percentage of the instances from the group.

## Replacement logic ##

Once enabled on an AutoScaling group, it is gradually replacing all the
on-demand instances belonging to the group with compatible and similarly
configured but cheaper spot instances.

The replacements are done using the relatively new Attach/Detach actions
supported by the AutoScaling API. A new compatible spot instance is launched,
and after a while, at least as much as the group's grace period, it will be
attached to the group, while at the same time an on-demand instance is detached
from the group and terminated in order to keep the group at constant capacity.

When assessing the compatibility, it takes into account the hardware specs, such
as CPU cores, RAM size, attached instance store volumes and their type and size,
as well as the supported virtualization types (HVM or PV) of both instance
types. The new spot instance is usually a few times cheaper than the original
instance, while also often providing more computing capacity.

The new spot instance is configured with the same roles, security groups and
tags and set to execute the same user data script as the original instance, so
from a functionality perspective it should be indistinguishable from other
instances in the group, although its hardware specs may be slightly
different(again: at least the same, but often can be of bigger capacity).

When replacing multiple instances in a group, the algorithm tries to use a wide
variety of instance types, in order to reduce the probability of simultaneous
failures that may impact the availability of the entire group. It always tries
to launch the cheapest available compatible instance type, but if the group
already has a considerable amount of instances of that type in the same
availability zone (currently more than 20% of the group's capacity is in that
zone and of that instance type), it picks the second cheapest compatible
instance, and so on.

During multiple replacements performed on a given group, it only swaps them one
at a time per Lambda function invocation, in order to not change the group too
fast, but instances belonging to multiple groups can be replaced concurrently.
If you find this slow, the Lambda function invocation frequency (defaulting to
once every 5 minutes) can be changed by updating the CloudFormation stack, which
has a parameter for it.

In the (so far unlikely) case in which the market price is high enough that
there are no spot instances that can be launched, (and also in case of software
crashes which may still rarely happen), the group would not be changed and it
would keep running as it is, but AutoSpotting will continuously attempt to
replace them, until eventually the prices decrease again and replaecments may
succeed again.

## Internal components ##

When deployed, the software consists on a number of resources running in your
Amazon AWS account, created automatically with CloudFormation:

### Event generator ###

CloudWatch event source used for triggering the Lambda function. The default
frequency is every 5 minutes, but it is configurable using CloudFormation.

### Lambda function ###

- AWS Lambda function connected to the event generator, which triggers it
periodically.
- It has assigned a IAM role and policy with a set of permissions to call the
APIs of various AWS services(EC2 and AutoScaling for now) within the user's
account.
- The permissions are the minimal set required for it to work without the need
of passing any explicit AWS credentials or access keys.
- Some algorithm parameters can be configured using Lambda environment
variables, based on some of the CloudFormation stack parameters.
- Contains a handler written in Golang, built using the
[eawsy/aws-lambda-go](https://github.com/eawsy/aws-lambda-go) library, which
implements a novel aproach that allows Golang code compiled natively to be
built in such a way that it can be injected into the Lambda Python runtime.
- The handler implements all the instance replacement logic.
- The spot instances are created by duplicating the configuration of the
currently running on-demand instances as closely as possible(IAM roles,
security groups, user_data script, etc.) only by adding a spot bid price
attribute and eventually changing the instance type to a usually bigger, but
compatible one.
- The bid price is set to the on-demand price of the instances configured
initially on the AutoScaling group.
- The new launch configuration may also have a different instance type,
determined based on compatibility with the original instance type,
considering also how much redundancy we need to have in place in the current
availability zone, in order to survive instance termination when outbid for
a certain instance type.

## Running example ##

![Workflow](https://cdn.cloudprowess.com/images/autospotting.gif)

In this case the initial instance type was quite expensive, so the algorithm
chose a different type that had more computing capacity. At the end that group
had 3x more CPU cores and 66% more RAM than in the initial state of the group,
and all this with 33% cost savings and without running entirely on spot
instances, since some users find that a bit risky.

Nevertheless, AutoSpotting tends to be quite reliable even on all-spot
configurations (has automated failover to on-demand nodes and spreads over
multiple price zones), where it can often achieve savings up to 90% off
the usual on-demand prices, much like in the 85% price reduction shown below.
This was seen on a group of two m3.medium instances running in eu-west-1:

![Savings Graph](https://cdn.cloudprowess.com/images/autospotting-savings.png)

## Best Practices ##

These recommendations apply for most cloud environments, but they become
especially important when using more volatile spot instances.

- **Set a non-zero grace period on the AutoScaling group**
- in order to attach spot instances only after they are fully configured.
- otherwise they may be attached prematurely before being ready.
- they may also be terminated after failing load balancer health checks.

- **Check your instance storage and block device mapping configuration**
- this may become an issue if you use instances which have ephemeral instance
storage, often the case on previous instance types.
- you should only specify ephemeral instance store in the on-demand launch
configuration if you do make use of it by mounting it on the filesystem.
- the replacement algorithm tries to give you instances with as much instance
storage as your original instances, since it can't tell if you did mount it.
- this adds more constraints on the algorithm, so it reduces the number of
compatible instance types it can use for launching spot instances.
- this is fine if you actually use that instance storage, but it is reducing
your options if you don't actually use it, so it may more often fail to get
spot instances and fall back to on-demand capacity.

- **Don't keep state on instances**
- You should delegate all your state to external services, AWS has a wide
offering of stateful services which allow your instances to become
stateless.
- Databases: RDS, DynamoDB
- Caches: ElastiCache
- Storage: S3, EFS
- Queues: SQS
- Don't attach EBS volumes to individual instances, try to use EFS instead.

- **Handle the spot instance termination signal**
- See the next section for more detailed instructions.

## Spot termination notifications ##

AWS
[notifies](https://aws.amazon.com/blogs/aws/new-ec2-spot-instance-termination-notices/)
your spot instances when they are about to be terminated by setting a dedicated
metadata field, so you can make use of that information to save whatever
temporary state you may still have on your running spot instances or to
gracoiusly remove them from the group.

There are existing third party tools which implement such a termination
notification handler, such as [seespot](https://github.com/acksin/seespot). This
will need to be integrated into your user_data script, for more details you can
read see the seespot tool's documentation.

### Instances behind an ELB ###

Instances behind an ELB can be graciously
[removed](https://aws.amazon.com/blogs/aws/elb-connection-draining-remove-instances-from-service-with-care/)
from the load balancer without losing connections. You should enable the
connection draining feature, and then you just need to append this snippet to
your user_data script, assuming your instances have enough IAM role permissions
to remove themselves from the load balancer:

### ECS container hosts ###

The container hosts can now be
[drained](http://docs.aws.amazon.com/AmazonECS/latest/developerguide/container-instance-draining.html)
in a similar way, migrating all the Docker containers to the other hosts from
your cluster before the spot instance is terminated. This blog
[post](https://aws.amazon.com/blogs/compute/how-to-automate-container-instance-draining-in-amazon-ecs/)
explains it in great detail, until AWS hopefully implements this out of the box.
1 change: 0 additions & 1 deletion docs/CNAME

This file was deleted.

Loading

0 comments on commit b4938f0

Please sign in to comment.