This is a BTRFS deduplication utility. It operates in a batch mode, scanning for files with the same size, performing an SHA256 hash on each one, then invoking the kernel deduplication ioctl for all those that match.
It is written by James Pharaoh.
It is hosted at [gitlab.wellbehavedsoftware.com] (https://gitlab.wellbehavedsoftware.com/well-behaved-software/wbs-backup/tree/master/btrfs-dedupe) — please report any issues or feature requests here.
It is also available from the following locations:
-
[Github] (https://github.com/wellbehavedsoftware/wbs-backup/tree/master/btrfs-dedupe) — this is a clone of the gitlab repository, where bug reports etc should be made. Pull requests are welcome here, but issues should be tracked at gitlab (above).
-
WBS Dist — this contains binary packages for Ubuntu trusty and xenial.
The utility is very simple. It takes a list of directories, scans for files with matching sizes, performs an SHA256 checksum on each one, then invokes the ioctl to deduplicate the entire file for every match it finds. Optionally, it can match filenames as well as sizes; this may make the program run faster in some cases.
From the built-in help:
$ btrfs-dedupe --help
Btrfs Dedupe
USAGE:
btrfs-dedupe [FLAGS] [<PATH>]
FLAGS:
-h, --help Prints help information
--match-filename Match filename as well as checksum
-V, --version Prints version information
ARGS:
<PATH>... Root path to scan for files
There are two alternatives, of which I am aware:
-
Duperemove — Performs a block-level hash on files and attempts to deduplicate parts of files. This is overkill for my purposes, although I have no reason to believe it does not work well. I believe it will be slower than this tool, since it does a far deeper analysis of file contents.
-
Bedup — Performs a similar task to this tool, plus it keeps a database of files in order to avoid checksumming again. The main implementation, however, does not use the kernel ioctls (which were simply not available when it was created), although a branch supports this. It also suffers from leaving filesystems in an inconsistent state in the case of errors, namely setting files as immutable, and it also crashes if there are many files to deduplicate.
There is also [ongoing work] (http://www.mail-archive.com/linux-btrfs%40vger.kernel.org/msg32862.html) to enable automatic realtime deduplication in the filesystem itself, but this is likely to take a long time to stablise, and there are fundamental issues with the concept which make it unsuitable for many cases.
There is a wiki page with general information about the state of deduplication in BTRFS.