These are reference scripts to demonstrate how to do gradual deployments using AWS Step Functions versions and aliases.
You can use these scripts as inspiration to provision your own gradual deployments in your CI/CD environments of choice.
The Python example shows how to use an AWS SDK to manage a gradual deployment, whereas the Bash script shows which AWS CLI commands you can use if you prefer. An alternative is to use CloudFormation for Step Functions Gradual Deployments.
Since this is a Python script, you need the Python 3 runtime.
To run sfndeploy.py, you will need to install boto3 and configure it with your credentials.
tldr; pip install boto3
sfndeploy.py is a Python 3 script showing how to use the boto3 AWS SDK for Python to create gradual deployments with Step Functions.
This script demonstrates the following deployment strategies:
- Canary - route a small percentage of traffic to the new version initially, then after a validation period where no alarms trigger, switch 100% to that new version.
- Linear (aka Rolling) - route a percentage of traffic, which increases over time from 0% to 100%, to the new version, rolling back immediately if any alarms trigger.
- All at Once (aka Blue/Green) - immediately switch 100% to the new version, monitor the new version and roll-back automatically to the previous version if any alarms trigger.
A Canary strategy deploys in two steps: first a small increment of traffic routes to the new version, and if there are no problems during the set testing period it will switch 100% of traffic to the new version.
In this script, use --increment
to set the initial percentage of traffic to
route to the new version. The --interval
input specifies for how long (in
seconds) the Canary testing period lasts before switching 100% of traffic to
the new version.
Here is an example showing a Canary deploy using the defaults for increment (5) and interval (120 seconds):
./sfndeploy.py --state-machine my-state-machine --region us-east-1 --alias=my-alias --file my-dir/sample.asl.json --publish-revision --strategy canary
This example will:
- Upload the
sample.asl.json
file as a new revision of the state machine. - Publish the state machine definition you just uploaded as the next version.
- Initially point 5% of traffic to this new version, using the
my-alias
alias. You can change the percentage of traffic with the--increment
argument. - Wait for the default period of 120s You can change this value with the
--interval
input. - Switch 100% of traffic to the new version.
Now let's switch 30% of traffic to the new version for a test period of 300 seconds. During the 300s the scripts monitors two different alarms - if any of these alarms trigger the deployment will rollback. If the 300s complete with no alarms, the script switches 100% of traffic to the new version.
./sfndeploy.py --state-machine my-state-machine --region us-east-1 --alias=my-alias --publish-revision --strategy canary --increment 30 --interval 300 --alarms MaxCPU "API Error Breach"
Note in this script invocation there the optional file argument isn't specified,
so the --publish-revision
flag will publish the latest revision of the
state machine to the new version without uploading a new definition.
A Linear (or Rolling) deployment strategy gradually increases the percentage of traffic to the new state machine version from 0% to 100%, in regular increments.
For example, an --increment 20
with --interval 600
will increase traffic
by 20% every 600 seconds until the new version receives 100% of traffic.
If you set --alarms
, the script will monitor the alarms specified during the
deployment until all traffic routes 100% to the new version. If any of the
alarms go into the ALARM
state during the deployment window, the script will
automatically and immediately rollback to the previous version. You can
configure how often the script polls for alarms with --alarm-polling
.
./sfndeploy.py --state-machine my-state-machine --region us-east-1 --alias=my-alias --file my-dir/sample.asl.json --publish-revision --strategy linear --increment 20 --interval=600 --alarms MaxCPU "API Error Breach" --history-max 11
This example will:
- Upload the
sample.asl.json
file as a new revision of the state machine. - Publish the state machine revision created in the previous step as the next version.
- Route 20% of traffic to the new version for 600s.
- Increase the percentage of traffic directed to new new version by 20% each 600 seconds.
- Monitor the 2 alarms every minute, and rollback automatically if an alarm sounds.
- The script will then delete historic versions prior to 11 versions ago.
The increment
does not need to be a factor of 100. The script will increment
linearly until it reaches 100. The script caps the maximum weight at 100. If,
for example, you set increment
to 15, the script will increment in seven
steps - six steps of 15 to reach weight 90, and then the final step would only
add ten to reach 100. There wouldn't be any further increments in this case.
An All at Once strategy routes 100% of traffic to the new version immediately, then monitors for problems duirngs a configurable period. This is useful to support Blue/Green style deployment where you test the Green version first, then switch all your production traffic to that version. If any alarms trigger, the script will automatically rollback the alias to point to the Blue version.
You can set the monitoring period with --interval
(in seconds).
This deployment strategy ignores the --increment
input.
./sfndeploy.py --state-machine my-state-machine --region us-east-1 --alias=my-alias --file my-dir/sample.asl.json --publish-revision --strategy allatonce --interval=500 --alarms MaxCPU "API Error Breach" --history-max 10
This example will:
- Upload the
sample.asl.json
file as a new revision of the state machine. - Publish the state machine you just uploaded as the next version.
- Point 100% of traffic to this new version, using the alias.
- Monitor the two alarms for 500s, and rollback automatically if an alarm sounds.
- If there are no alarms during this period, the deploy was a success.
- The script will then delete historic versions prior to ten versions ago.
If you do not pass the optional --file
argument, the --publish-revision
flag
will just publish the latest revision of the state machine to the new version
without first uploading a new definition from a local file.
To get CLI input help, pass --help
:
./sfndeploy.py --help
Here is a summary of the inputs:
❯ ./sfndeploy.py --help
usage: sfndeploy [-h] --state-machine STATE_MACHINE --alias ALIAS --region REGION
[--strategy {allatonce,canary,linear}] [--alarms [ALARMS ...]]
[--file SM_FILE] [--publish-revision | --no-publish-revision]
[--increment INCREMENT] [--interval INTERVAL]
[--alarm-polling ALARM_POLLING] [--history-max HISTORY_MAX]
[--force | --no-force]
Gradually deploy AWS Step Functions state machines.
options:
-h, --help show this help message and exit
--state-machine STATE_MACHINE
Name of the state machine (not ARN).
--alias ALIAS Name of alias.
--region REGION Region name. e.g 'us-east-1'
--strategy {allatonce,canary,linear}
The type of deployment to do. By default will deploy AllAtOnce.
--alarms [ALARMS ...]
Optional list of CloudWatch alarm names to monitor during
deployment.
--file SM_FILE Optional path to state machine definition file to deploy. Will
upload this file as the latest revision of the state machine. If
you don't set this, will use the current latest revision.
--publish-revision, --no-publish-revision
Publish the current revision to the next version.
--increment INCREMENT
The increment for weight increase during deploy strategy, from
0-100%. Just input the number, not the % sign.
--interval INTERVAL The interval in seconds at which to increase weight during the
deploy strategy.
--alarm-polling ALARM_POLLING
Poll alarms at this interval in seconds. Default 60s.
--history-max HISTORY_MAX
Maximum number of versions to keep in history. Will delete
versions older than this. Set to 0 to disable (this is the
default). There is a 1000 version limit in Step Functions.
--force, --no-force Force the deploy to start, even if the alias is not currently
pointing 100% at the old version. This may be required to recover
from a previous deploy that failed and didn't roll back correctly.
This means you might be overwriting an in-progress deploy, or that
something went wrong in a previous deploy. Be careful when
combining with publish_revision - if you just rerun the script you
might force publish a previously uploaded revision without
testing.
Step Functions limits the number of versions per state machine to 1000. As you release new versions of a state machine, the older versions remain in the state machine. This can be useful because you might need to rollback to a previous version.
To avoid the the build-up of historic versions to reach the limit of 1000, you need to trim your version history by deleting older versions once you are sure that you do not need them anymore.
This script provides an automatic version history deletion mechanism that runs
after a deploy completed. You enable this with the --history-max
argument.
The script will delete any versions prior to n
versions ago, where n
is the
number you pass to history-max
.
For example, if you pass --history-max 5
, the script will only keep five
versions and delete any versions prior to that.
Carefully consider when a previous version is ready for deletion - you might need to rollback to it or to refer to it for auditing purposes. Once you delete a state machine version, it is gone forever.
By default, the script polls for alarms every 60 seconds. This is because many AWS services only have an alarm granularity of 60s.
You can set the polling frequency with the --alarm-polling argument
.
For example, set --alarm-polling 23
and the script will poll all the
--alarms
every 23 seconds.
The alarm polling interval is completely independent from --interval
, so it
does NOT need to be evenly divisible.
Take care to align how often you poll for alarms with the deployment window
that you set with --interval
. If --alarm-polling
is high relative to
--interval
the deployment window could finish before the script polls the
alarms.
The script takes the following actions:
- If
--file
specified, upload that file as the new revision for the state machine - If
--publish-revision
set, publish the latest revision of the state machine as the next version. This will become the new version to deploy. If you combine this with the--file
input, this will publish the file you just uploaded as the next version. - Note that if the script fails later on it will NOT undeploy any revision uploaded or promoted to a version in the first two steps. See Cloudformation for provisioning with full rollback.
- If
--publish-revision
is not set, the most recent published version of the state machine will deploy. This is useful if you have some other process or tool that updates your state machine definitions, and you just want to use this script as a way to switch the alias from the old version to that new version. - Create the specified alias if it does not exist. If the alias didn't exist, route 100% of traffic to the new version and exit the script. This is because it would be a first deploy, so there is no rollback possible.
- Start routing traffic to the alias using the deployment strategy set by
--strategy
. (AllAtOnce, Linear, Canary). The--increment
and--interval
arguments govern how the strategy you select behaves. - Monitor any
--alarms
specified during the entire deployment period and rollback automatically if any of these go into ALARM state. - If deployment completes successfully, keep the number of versions set by
history_max
and delete state machine versions prior to that. The default value of 0 forhistory_max
disables this deletion of old versions, but remember there is a limit of 1000 versions per state machine.
You can run the unit tests for sfndeploy.py
like this:
python -m unittest sfndeploy_test.py
This bash script shows how to use AWS CLI commands to do a gradual deployment.
To run sfn-canary-deploy.sh, you will need the AWS CLI installed and configured.
sfn-canary-deploy.sh is a bash script showing how to use the AWS CLI to create and manage a Canary-style deployment. For AllAtOnce or Linear deployments, see the Python version above.
The script does the following:
- Publish the most recent revision as the next version of the state machine if
publish_revision
is true. This will become the new live version. - If
publish_revision
is false, the most recent published version of the state machine will deploy. - Create the alias if it doesn’t exist yet. If the alias didn't exist, point 100% of traffic for this alias to the new version, then exit the script.
- Update the routing configuration for the alias to direct a small
percentage of traffic from the previous to the new version. You set this
canary percentage with
canary_percentage
. - Monitor the configurable CloudWatch alarms every 60s by default. If any of
these alarms trigger, rollback the deployment immediately by pointing 100%
of traffic to the known-good previous version. Will keep on monitoring the
alarms every
alarm_polling_interval
in seconds untilcanary_interval_seconds
have passed. - If there were no alarms during the canary interval, shift 100% of traffic to
the new version. You set this interval with
canary_interval_seconds
. - Upon successful deployment, delete any versions older than
history_max
.
Here are some tips to get you started with popular CD platforms:
You can run your customized Bash or Python on Jenkins by using the sh
step in the Jenkinsfile
to run your custom script.
pipeline {
agent any
stages {
stage('Build') {
steps {
echo 'Building..'
}
}
stage('Test') {
steps {
echo 'Testing..'
}
}
stage(‘Gradual Deploy') {
steps {
sh /path/to/gradual-deploy-script.sh
}
}
}
}
You have some options to configure the prequisites:
- If you want to run your script directly from the Jenkins pipeline, you must install and configure your prerequisites on the Jenkins server instance - in this case the AWS CLI for the Bash script or Boto3 for the Python script.
- The Jenkins user must have AWS credentials to access the Step Functions service.
- If you are using the standard Amazon Machine Image (AMI) as a base for your Jenkins installation this already contains the prequisites.
- Alternatively, if you want to use custom Docker images to encapsulate your dependencies and scripts, you can use the Docker Pipeline Plugin and let Jenkins run your scripts inside the container.
Use the Jenkins stage or the Script stage in Spinnaker to run a custom shell or Python script from your pipeline.
With the Script stage, Spinnaker uses Jenkins to sandbox your scripts, so you need to set up a Jenkins instance in order to use it.
In your Spinnaker deck, select:
- Add Stage.
- Select the
Script
type of stage. - Under
Command
enter your script invocation. - Set
Depends On
if there’s a preceding stage that should run before your custom script.
Alternatively, you can encapsulate your logic and its dependencies in a container and execute it with a Run Job stage.
apiVersion: batch/v1
kind: Job
metadata:
name: gradual-deploy
spec:
backoffLimit: 0
template:
spec:
containers:
- command:
- python
- path/to/my/script.py
image: 'myrepo/mycontainer:1.2.3'
name: my-custom-script
restartPolicy: Never
Remember that creating and running resources in AWS costs money. Take care to delete resources when you're done to avoid billing surprises.
All the scripts in this repo are examples that are not meant for production systems. The scripts here do not clean up or release resources when finished. Take care & run at your own risk.