Skip to content

Latest commit

 

History

History

gradual-deploy

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 

Step Functions Gradual Deployments

These are reference scripts to demonstrate how to do gradual deployments using AWS Step Functions versions and aliases.

You can use these scripts as inspiration to provision your own gradual deployments in your CI/CD environments of choice.

The Python example shows how to use an AWS SDK to manage a gradual deployment, whereas the Bash script shows which AWS CLI commands you can use if you prefer. An alternative is to use CloudFormation for Step Functions Gradual Deployments.

Python API example

Prerequisites

Since this is a Python script, you need the Python 3 runtime.

To run sfndeploy.py, you will need to install boto3 and configure it with your credentials.

tldr; pip install boto3

Script Overview

sfndeploy.py is a Python 3 script showing how to use the boto3 AWS SDK for Python to create gradual deployments with Step Functions.

This script demonstrates the following deployment strategies:

  1. Canary - route a small percentage of traffic to the new version initially, then after a validation period where no alarms trigger, switch 100% to that new version.
  2. Linear (aka Rolling) - route a percentage of traffic, which increases over time from 0% to 100%, to the new version, rolling back immediately if any alarms trigger.
  3. All at Once (aka Blue/Green) - immediately switch 100% to the new version, monitor the new version and roll-back automatically to the previous version if any alarms trigger.

Canary

A Canary strategy deploys in two steps: first a small increment of traffic routes to the new version, and if there are no problems during the set testing period it will switch 100% of traffic to the new version.

In this script, use --increment to set the initial percentage of traffic to route to the new version. The --interval input specifies for how long (in seconds) the Canary testing period lasts before switching 100% of traffic to the new version.

Example with defaults

Here is an example showing a Canary deploy using the defaults for increment (5) and interval (120 seconds):

 ./sfndeploy.py --state-machine my-state-machine --region us-east-1 --alias=my-alias --file my-dir/sample.asl.json --publish-revision --strategy canary

This example will:

  • Upload the sample.asl.json file as a new revision of the state machine.
  • Publish the state machine definition you just uploaded as the next version.
  • Initially point 5% of traffic to this new version, using the my-alias alias. You can change the percentage of traffic with the --increment argument.
  • Wait for the default period of 120s You can change this value with the --interval input.
  • Switch 100% of traffic to the new version.

Example with values for increment and interval

Now let's switch 30% of traffic to the new version for a test period of 300 seconds. During the 300s the scripts monitors two different alarms - if any of these alarms trigger the deployment will rollback. If the 300s complete with no alarms, the script switches 100% of traffic to the new version.

./sfndeploy.py --state-machine my-state-machine --region us-east-1 --alias=my-alias --publish-revision --strategy canary --increment 30 --interval 300 --alarms MaxCPU "API Error Breach"

Note in this script invocation there the optional file argument isn't specified, so the --publish-revision flag will publish the latest revision of the state machine to the new version without uploading a new definition.

Linear

A Linear (or Rolling) deployment strategy gradually increases the percentage of traffic to the new state machine version from 0% to 100%, in regular increments.

For example, an --increment 20 with --interval 600 will increase traffic by 20% every 600 seconds until the new version receives 100% of traffic.

If you set --alarms, the script will monitor the alarms specified during the deployment until all traffic routes 100% to the new version. If any of the alarms go into the ALARM state during the deployment window, the script will automatically and immediately rollback to the previous version. You can configure how often the script polls for alarms with --alarm-polling.

./sfndeploy.py --state-machine my-state-machine --region us-east-1 --alias=my-alias --file my-dir/sample.asl.json --publish-revision --strategy linear --increment 20 --interval=600 --alarms MaxCPU "API Error Breach" --history-max 11

This example will:

  • Upload the sample.asl.json file as a new revision of the state machine.
  • Publish the state machine revision created in the previous step as the next version.
  • Route 20% of traffic to the new version for 600s.
  • Increase the percentage of traffic directed to new new version by 20% each 600 seconds.
  • Monitor the 2 alarms every minute, and rollback automatically if an alarm sounds.
  • The script will then delete historic versions prior to 11 versions ago.

The increment does not need to be a factor of 100. The script will increment linearly until it reaches 100. The script caps the maximum weight at 100. If, for example, you set increment to 15, the script will increment in seven steps - six steps of 15 to reach weight 90, and then the final step would only add ten to reach 100. There wouldn't be any further increments in this case.

All at Once

An All at Once strategy routes 100% of traffic to the new version immediately, then monitors for problems duirngs a configurable period. This is useful to support Blue/Green style deployment where you test the Green version first, then switch all your production traffic to that version. If any alarms trigger, the script will automatically rollback the alias to point to the Blue version.

You can set the monitoring period with --interval (in seconds).

This deployment strategy ignores the --increment input.

./sfndeploy.py --state-machine my-state-machine --region us-east-1 --alias=my-alias --file my-dir/sample.asl.json --publish-revision --strategy allatonce --interval=500 --alarms MaxCPU "API Error Breach" --history-max 10

This example will:

  • Upload the sample.asl.json file as a new revision of the state machine.
  • Publish the state machine you just uploaded as the next version.
  • Point 100% of traffic to this new version, using the alias.
  • Monitor the two alarms for 500s, and rollback automatically if an alarm sounds.
  • If there are no alarms during this period, the deploy was a success.
  • The script will then delete historic versions prior to ten versions ago.

If you do not pass the optional --file argument, the --publish-revision flag will just publish the latest revision of the state machine to the new version without first uploading a new definition from a local file.

CLI inputs

To get CLI input help, pass --help:

./sfndeploy.py --help

Here is a summary of the inputs:

❯ ./sfndeploy.py --help
usage: sfndeploy [-h] --state-machine STATE_MACHINE --alias ALIAS --region REGION
                 [--strategy {allatonce,canary,linear}] [--alarms [ALARMS ...]]
                 [--file SM_FILE] [--publish-revision | --no-publish-revision]
                 [--increment INCREMENT] [--interval INTERVAL]
                 [--alarm-polling ALARM_POLLING] [--history-max HISTORY_MAX]
                 [--force | --no-force]

Gradually deploy AWS Step Functions state machines.

options:
  -h, --help            show this help message and exit
  --state-machine STATE_MACHINE
                        Name of the state machine (not ARN).
  --alias ALIAS         Name of alias.
  --region REGION       Region name. e.g 'us-east-1'
  --strategy {allatonce,canary,linear}
                        The type of deployment to do. By default will deploy AllAtOnce.
  --alarms [ALARMS ...]
                        Optional list of CloudWatch alarm names to monitor during
                        deployment.
  --file SM_FILE        Optional path to state machine definition file to deploy. Will
                        upload this file as the latest revision of the state machine. If
                        you don't set this, will use the current latest revision.
  --publish-revision, --no-publish-revision
                        Publish the current revision to the next version.
  --increment INCREMENT
                        The increment for weight increase during deploy strategy, from
                        0-100%. Just input the number, not the % sign.
  --interval INTERVAL   The interval in seconds at which to increase weight during the
                        deploy strategy.
  --alarm-polling ALARM_POLLING
                        Poll alarms at this interval in seconds. Default 60s.
  --history-max HISTORY_MAX
                        Maximum number of versions to keep in history. Will delete
                        versions older than this. Set to 0 to disable (this is the
                        default). There is a 1000 version limit in Step Functions.
  --force, --no-force   Force the deploy to start, even if the alias is not currently
                        pointing 100% at the old version. This may be required to recover
                        from a previous deploy that failed and didn't roll back correctly.
                        This means you might be overwriting an in-progress deploy, or that
                        something went wrong in a previous deploy. Be careful when
                        combining with publish_revision - if you just rerun the script you
                        might force publish a previously uploaded revision without
                        testing.

Version History Deletion

Step Functions limits the number of versions per state machine to 1000. As you release new versions of a state machine, the older versions remain in the state machine. This can be useful because you might need to rollback to a previous version.

To avoid the the build-up of historic versions to reach the limit of 1000, you need to trim your version history by deleting older versions once you are sure that you do not need them anymore.

This script provides an automatic version history deletion mechanism that runs after a deploy completed. You enable this with the --history-max argument. The script will delete any versions prior to n versions ago, where n is the number you pass to history-max.

For example, if you pass --history-max 5, the script will only keep five versions and delete any versions prior to that.

Carefully consider when a previous version is ready for deletion - you might need to rollback to it or to refer to it for auditing purposes. Once you delete a state machine version, it is gone forever.

Alarm Polling Frequency

By default, the script polls for alarms every 60 seconds. This is because many AWS services only have an alarm granularity of 60s.

You can set the polling frequency with the --alarm-polling argument. For example, set --alarm-polling 23 and the script will poll all the --alarms every 23 seconds.

The alarm polling interval is completely independent from --interval, so it does NOT need to be evenly divisible.

Take care to align how often you poll for alarms with the deployment window that you set with --interval. If --alarm-polling is high relative to --interval the deployment window could finish before the script polls the alarms.

Script Process Flow

The script takes the following actions:

  1. If --file specified, upload that file as the new revision for the state machine
  2. If --publish-revision set, publish the latest revision of the state machine as the next version. This will become the new version to deploy. If you combine this with the --file input, this will publish the file you just uploaded as the next version.
  3. Note that if the script fails later on it will NOT undeploy any revision uploaded or promoted to a version in the first two steps. See Cloudformation for provisioning with full rollback.
  4. If --publish-revision is not set, the most recent published version of the state machine will deploy. This is useful if you have some other process or tool that updates your state machine definitions, and you just want to use this script as a way to switch the alias from the old version to that new version.
  5. Create the specified alias if it does not exist. If the alias didn't exist, route 100% of traffic to the new version and exit the script. This is because it would be a first deploy, so there is no rollback possible.
  6. Start routing traffic to the alias using the deployment strategy set by --strategy. (AllAtOnce, Linear, Canary). The --increment and --interval arguments govern how the strategy you select behaves.
  7. Monitor any --alarms specified during the entire deployment period and rollback automatically if any of these go into ALARM state.
  8. If deployment completes successfully, keep the number of versions set by history_max and delete state machine versions prior to that. The default value of 0 for history_max disables this deletion of old versions, but remember there is a limit of 1000 versions per state machine.

Unit tests

You can run the unit tests for sfndeploy.py like this:

python -m unittest sfndeploy_test.py

AWS CLI example

This bash script shows how to use AWS CLI commands to do a gradual deployment.

Prerequisites

To run sfn-canary-deploy.sh, you will need the AWS CLI installed and configured.

CLI Script Overview

sfn-canary-deploy.sh is a bash script showing how to use the AWS CLI to create and manage a Canary-style deployment. For AllAtOnce or Linear deployments, see the Python version above.

The script does the following:

  1. Publish the most recent revision as the next version of the state machine if publish_revision is true. This will become the new live version.
  2. If publish_revision is false, the most recent published version of the state machine will deploy.
  3. Create the alias if it doesn’t exist yet. If the alias didn't exist, point 100% of traffic for this alias to the new version, then exit the script.
  4. Update the routing configuration for the alias to direct a small percentage of traffic from the previous to the new version. You set this canary percentage with canary_percentage.
  5. Monitor the configurable CloudWatch alarms every 60s by default. If any of these alarms trigger, rollback the deployment immediately by pointing 100% of traffic to the known-good previous version. Will keep on monitoring the alarms every alarm_polling_interval in seconds until canary_interval_seconds have passed.
  6. If there were no alarms during the canary interval, shift 100% of traffic to the new version. You set this interval with canary_interval_seconds.
  7. Upon successful deployment, delete any versions older than history_max.

Gradual Deployments from CI/CD tools

Here are some tips to get you started with popular CD platforms:

Jenkins

You can run your customized Bash or Python on Jenkins by using the sh step in the Jenkinsfile to run your custom script.

pipeline {
    agent any

    stages {
        stage('Build') {
            steps {
                echo 'Building..'
            }
        }
        stage('Test') {
            steps {
                echo 'Testing..'
            }
        }
        stage(‘Gradual Deploy') {
            steps {
                sh /path/to/gradual-deploy-script.sh
            }
        }
    }
}

You have some options to configure the prequisites:

  • If you want to run your script directly from the Jenkins pipeline, you must install and configure your prerequisites on the Jenkins server instance - in this case the AWS CLI for the Bash script or Boto3 for the Python script.
  • The Jenkins user must have AWS credentials to access the Step Functions service.
  • If you are using the standard Amazon Machine Image (AMI) as a base for your Jenkins installation this already contains the prequisites.
  • Alternatively, if you want to use custom Docker images to encapsulate your dependencies and scripts, you can use the Docker Pipeline Plugin and let Jenkins run your scripts inside the container.

Spinnaker

Use the Jenkins stage or the Script stage in Spinnaker to run a custom shell or Python script from your pipeline.

With the Script stage, Spinnaker uses Jenkins to sandbox your scripts, so you need to set up a Jenkins instance in order to use it.

In your Spinnaker deck, select:

  • Add Stage.
  • Select the Script type of stage.
  • Under Command enter your script invocation.
  • Set Depends On if there’s a preceding stage that should run before your custom script.

Alternatively, you can encapsulate your logic and its dependencies in a container and execute it with a Run Job stage.

apiVersion: batch/v1
kind: Job
metadata:
  name: gradual-deploy
spec:
  backoffLimit: 0
  template:
    spec:
      containers:
        - command:
            - python
            - path/to/my/script.py
          image: 'myrepo/mycontainer:1.2.3'
          name: my-custom-script
      restartPolicy: Never

Warning

Remember that creating and running resources in AWS costs money. Take care to delete resources when you're done to avoid billing surprises.

All the scripts in this repo are examples that are not meant for production systems. The scripts here do not clean up or release resources when finished. Take care & run at your own risk.