Many teams want to backup large datasets directly to S3 Glacier or Glacier Deep Archive for a low-cost DR strategy. The first consideration for doing this is to use aws s3 sync
or rclone
to do this. While this approach could work, it's important to understand the drawbacks.
Specifically, the cost of PUT operations and metadata overhead. The cost of PUT operations - in other words, writes - to Glacier is more expensive than S3 standard. Each object written to Glacier also has an overhead of ~32KB in Glacier AND 8KB of metadata stored in S3. This means that if you have many small files the cost could actually be more expensive to write them to Glacier than keeping them in S3.
To give an overly simplistic and dramatic scenario, let's say you have 10,000,000 1KB files. That's ~ 10GB of files total. To write this raw to Glacier, it would cost:
10,000,000 PUT operations / ($0.05/1,000 PUT operations) = $500
10,000,000 * 8KB = 8 GB * $0.023 = $0.184 per GB-Month S3 (for metdata)
(10,000,000 * 1KB + 10,000,000 * 32 KB)/(1024*1024) = 33 GB * $0.004 per GB-Month for Glacier (non-DA) = $0.16 per GB-month
Even though the total size was only 10Gb, it cost $500 for the initial upload plus $0.344 per month. For comparison, cost to upload the same files to S3 would be a tenth of that and the monthly cost would be roughly $0.23. The problem is amplified if there is a high change rate with more frequent backups/syncs.
To reap the benefits of using S3 Glacier as a low-cost backup target, backups need to be sent in a way that optimizes the number of PUT operations and file sizes. Many commercial backup/archival software products address this, but what if you're looking for an open and free alternative? This is where Duplicity comes in.
Duplicity is an open source command line backup utility that uses librsync
to perform full+incremental backups of files and write them to a variety of targets, including S3 and S3 Glacier. It optimizes backups by combining files into tar archives with (optional) compression and encryption. It also creates compact incremtenals by using file diffs rather than copying the entire contents of changed files. Using Duplicity, we can periodically create full backups and send them directly to glacier as consolidated tar files in S3 and S3 Glacier, then capture incremental backups of changes.
Revisiting the 10,000,000 1KB file example above, with Duplicity a full backup with a conservative 20% compression ratio and volume size of 1GB could result in as little as 8 1GB objects written to Glacier, costing just $0.0004 for PUT operations and $0.032 per month.
The guide below descibes how to setup and use a containerized version of Duplicity. It uses the a Docker image sourced from the Tecnativa repository on Github. This repo contains a Dockerfile that builds a Duplicity container along with helper utilities to simplify commond tasks and scenarios. Using either the commands or docker-compose below, you can quickly setup a Duplicity backup strategy to efficiently backup files directly to Glacier or Glacier Deep Archive on a regular basis.
The container provided in the Tecnativa repo could be used in two ways:
- Call the container to run individual duplicity commands and immediately exit
- Run the container continuously and use the built in CRON/scheduler functionality to run jobs
If you want to schedule and run Duplicity operations through on-demand executions, you can do so by using docker run
. This option is most useful if you do not want to continuously run the container.
First you'll need to clone the repo and build the docker image:
git clone https://github.com/Tecnativa/docker-duplicity.git
docker build --target latest -t docker-duplicity:latest ./docker-duplicity
Note: The repo is available on Docker Hub as well, but it is a good practice to clone the repo and build/push the image into your own container image repo.
With the image built and tagged, we can run individual commands. Below are examples.
**Note: ** The commands below assume S3 credentials are available to the container. If not, they can be passed in environment variables as AWS_ACCESS_KEY_ID
and AWS_SECRET_ACCESS_KEY
. Always store these credentials in a secure location and make sure the IAM principal they are tied to has least privilege access!
docker run -it --rm --hostname 'backupname' \
-e OPTIONS="--full-if-older-than 4W --s3-use-deep-archive --no-compression --no-encryption --progress --archive-dir /mnt/backup/src/config --s3-use-multiprocessing --volsize 1024 --verbosity i --log-file /logfiles/log-$(date "+%Y%m%d-%H%M%S").log" \
-e OPTIONS_EXTRA="--asynchronous-upload --s3-european-buckets --s3-multipart-chunk-size 10 --s3-use-new-style --metadata-sync-mode partial --file-prefix-archive archive-$(hostname -f)- --file-prefix-manifest manifest-$(hostname -f)- --file-prefix-signature signature-$(hostname -f)-" \
-v /tmp/testfiles:/mnt/backup/src/data:ro \
-v "$PWD/config:/mnt/backup/src/config" \
-v "$PWD/logs:/logfiles" \
docker-duplicity:latest dup \
/mnt/backup/src/data \
boto3+s3://bucketname/backup-folder
Breaking this down:
docker run -it --rm --hostname 'backupname' \ |
Run the container using hostname 'backupname'. This should be a unique name that you use to describe the backup. It will be used in the name of the manifest file and in subsequent commands to lookup the backup metadata |
-e OPTIONS="--full-if-older-than 4W --s3-use-deep-archive --no-encryption --progress --archive-dir /mnt/backup/src/config --s3-use-multiprocessing --volsize 1024 --verbosity i --log-file /logfiles/log-$(date "+%Y%m%d-%H%M%S").log" \ |
Duplicity configuration options that will be used. The options in this example indicate to:
These can be adjusted as needed. |
-e OPTIONS_EXTRA="--asynchronous-upload --s3-european-buckets --s3-multipart-chunk-size 10 --s3-use-new-style --metadata-sync-mode partial --file-prefix-archive archive-$(hostname -f)- --file-prefix-manifest manifest-$(hostname -f)- --file-prefix-signature signature-$(hostname -f)-" \ |
These are additional Duplicity options that are applied to the command. In general, these don't need to be changed and are separated for convenience. |
-v /tmp/testfiles:/mnt/backup/src/data:ro \ |
Mounts /tmp/testfiles from the host and uses it as the backup source. Change this to the path of your backup source. |
-v "$PWD/config:/mnt/backup/src/config" \ |
Mounts ./config from to host as a volume in the container. Change this to a location where you can store Duplicity backup information persistently. The duplicity manifest and sig files are stored to track backup sets/chains. Note: Though these files are backed up to S3/Glacier, they should be stored persistently locally and used. Otherwise, you will need to restore the .sig file from Glacier in order to perform incrementals. |
-v "$PWD/logs:/logfiles" \ |
Mounts ./logs so that Duplicity can write its log files there and persist them. Change to a persistent location. |
docker-duplicity:latest dup \ |
The container and the command to execute (dup is a helper script used to run duplicity) |
/mnt/backup/src/ \ |
The backup source - in this case the mount point |
boto3+s3://bucketname/backup-folder |
The target to write to - in this case an S3 bucket. Note: the --s3-use-deep-archive option above will cause the backups to be written to the Glacier Deep Archive tier in S3. |
The command below will cleanup backup sets (fulls and incrementals) that are older than the 7 most recent full backups.
In other words, if full backups are taken once every 4 weeks, this will ensure fulls and incrementals that are less than 28 * 4 * 7 = 180
days old are retained.
docker run -it --rm --hostname 'backupname' \
-e OPTIONS="--full-if-older-than 4W --s3-use-deep-archive --no-compression --no-encryption --progress --archive-dir /mnt/backup/src/config --s3-use-multiprocessing --volsize 1024 --verbosity i --log-file /logfiles/log-$(date "+%Y%m%d-%H%M%S").log" \
-e OPTIONS_EXTRA="--asynchronous-upload --s3-european-buckets --s3-multipart-chunk-size 10 --s3-use-new-style --metadata-sync-mode partial --file-prefix-archive archive-$(hostname -f)- --file-prefix-manifest manifest-$(hostname -f)- --file-prefix-signature signature-$(hostname -f)-" \
-v /tmp/testfiles:/mnt/backup/src/data:ro \
-v "$PWD/config:/mnt/backup/src/config" \
-v "$PWD/logs:/logfiles" \
docker-duplicity:latest dup \
remove-all-but-n-full 7 \
boto3+s3://bucketname/backup-folder
List all backup chains - fulls and incrementals
docker run -it --rm --hostname 'backupname' \
-e OPTIONS="--full-if-older-than 47W --s3-use-deep-archive --no-compression --no-encryption --progress --archive-dir /mnt/backup/src/config --s3-use-multiprocessing --volsize 1024 --verbosity i --log-file /logfiles/log-$(date "+%Y%m%d-%H%M%S").log" \
-e OPTIONS_EXTRA="--asynchronous-upload --s3-european-buckets --s3-multipart-chunk-size 10 --s3-use-new-style --metadata-sync-mode partial --file-prefix-archive archive-$(hostname -f)- --file-prefix-manifest manifest-$(hostname -f)- --file-prefix-signature signature-$(hostname -f)-" \
-v /tmp/testfiles:/mnt/backup/src/data:ro \
-v "$PWD/config:/mnt/backup/src/config" \
-v "$PWD/logs:/logfiles" \
docker-duplicity:latest dup \
collection-status \
boto3+s3://bucketname/backup-folder
List all files in the latest backup. To list files from a specific point in time, use the --time TIME
option. See TIME format for more information on how to specify TIME
.
docker run -it --rm --hostname 'backupname' \
-e OPTIONS="--full-if-older-than 47W --s3-use-deep-archive --no-compression --no-encryption --progress --archive-dir /mnt/backup/src/config --s3-use-multiprocessing --volsize 1024 --verbosity i --log-file /logfiles/log-$(date "+%Y%m%d-%H%M%S").log" \
-e OPTIONS_EXTRA="--asynchronous-upload --s3-european-buckets --s3-multipart-chunk-size 10 --s3-use-new-style --metadata-sync-mode partial --file-prefix-archive archive-$(hostname -f)- --file-prefix-manifest manifest-$(hostname -f)- --file-prefix-signature signature-$(hostname -f)-" \
-v /tmp/testfiles:/mnt/backup/src/data:ro \
-v "$PWD/config:/mnt/backup/src/config" \
-v "$PWD/logs:/logfiles" \
docker-duplicity:latest dup \
list-current-files \
boto3+s3://bucketname/backup-folder
The container from the Tecnativa repo has a built-in schduler option where the container can run continously and execute Duplcity commands on a schedule. The arguments are similar.
The following is an example of a docker-compose
file that will build the container image, launch the container and schedule a daily backup of a source to S3. Similar to the docker run
commands above, it is configured to create a full backup every 4 weeks and incrementals otherwise. It also cleans up backup sets (fulls and associated incrementals) that are older than the last 7 fulls.
version: "3.7"
services:
backup:
build:
context: https://github.com/Tecnativa/docker-duplicity.git
target: latest
image: docker-duplicity:latest
hostname: srcname.backup
container_name: srcname
privileged: true # To speak with host's docker socket
#context: ./
volumes:
- ./config:/mnt/backup/src/config
- /tmp/testfiles://mnt/backup/src/data:ro
- /var/run/docker.sock:/var/run/docker.sock:ro
- ./logs:/logfiles
environment:
# JOB_200: Run backup daily. Based on $OPTIONS below, this will create a full every 4 weeks and incrementals otherwise
JOB_200_WHAT: dup $$SRC $$DST
JOB_200_WHEN: daily
# JOB_300: Cleanup backups. This will delete backup files that are older than the last 6 fulls. 4 weeks per full x 7 fulls = 180 week retention
# NOTE: Without --force this only lists chains to be deleted
JOB_300_WHAT: dup remove-all-but-n-full 7 $$DST --force
JOB_300_WHEN: daily
# SPECIFY an AWS ACCESS_KEY to use. DO NOT store the key and ID docker-compose file itself to avoid accidentally committing credentials to a public repo! It will be compromised in minutes! Use environment vars or secrets instead
# If you run this container on ECS, you can automatically use the task execution role's permissions instead of specifying them here.
#AWS_ACCESS_KEY_ID: example amazon s3 access key
#AWS_SECRET_ACCESS_KEY: example amazon s3 secret key
DST: boto3+s3://bucket/backup-path
SMTP_HOST: localhost
EMAIL_FROM: [email protected]
EMAIL_TO: [email protected]
# Duplicity options: http://duplicity.nongnu.org/vers8/duplicity.1.html#sect5
OPTIONS: --s3-use-deep-archive --no-encryption --archive-dir /mnt/backup/src/config --s3-use-multiprocessing --volsize 1024 --verbosity i --log-file /logfiles/log-$$(date "+%Y%m%d-%H%M%S").log --full-if-older-than 4W --asynchronous-upload
# Generally, these shouldn't change
OPTIONS_EXTRA: --s3-european-buckets --s3-multipart-chunk-size 10 --s3-use-new-style --metadata-sync-mode partial --file-prefix-archive archive-$$(hostname -f)- --file-prefix-manifest manifest-$$(hostname -f)- --file-prefix-signature signature-$$(hostname -f)-
Note: This docker-compose
will pull and build containers directly from the Tecnative repo. It is suggested that you clone the repo, review the code, and then point docker-compose to your repo to buld the image.
If backing up to Glacier or Glacier Deep Archive, restorations will be a two-step process:
- First, you will need to restore the Duplicity backup archives from Glacier into S3.
- Once the files have been restored to S3, you can run this Duplicity command to recover files. The command below recovers a specific path,
/smallfiles
, in the backup. The recovered files are written to the local path/mnt/recovered
docker run -it --rm --hostname 'backupname' \
-e OPTIONS="--full-if-older-than 47W --s3-use-deep-archive --no-compression --no-encryption --progress --archive-dir /mnt/backup/src/config --s3-use-multiprocessing --volsize 1024 --verbosity i --log-file /logfiles/log-$(date "+%Y%m%d-%H%M%S").log" \
-e OPTIONS_EXTRA="--asynchronous-upload --s3-european-buckets --s3-multipart-chunk-size 10 --s3-use-new-style --metadata-sync-mode partial --file-prefix-archive archive-$(hostname -f)- --file-prefix-manifest manifest-$(hostname -f)- --file-prefix-signature signature-$(hostname -f)-" \
-v /tmp/testfiles:/mnt/backup/src/data:ro \
-v "$PWD/config:/mnt/backup/src/config" \
-v "$PWD/logs:/logfiles" \
-v "$PWD/recovered:/mnt/recovered" \
docker-duplicity:latest-s3 dup \
restore \
--file-to-restore smallfiles/ \
boto3+s3://bucketname/backup-folder
/mnt/recovered
Duplicity combines files and diffs into tar archives and (by default) compresses them. It's important to ensure you have enough local storage for tempoary files (generally, at least 2 times the value of volsize
from experimentation but more is recommended). You should also ensure enough CPU, memory, and networking resources are available to perform backups as quickly as possible. Perform test benchmarks to ensure the performance will be adequate.