Skip to content

Commit

Permalink
hashsplit: implement hashsplitting in C and make bits configurable
Browse files Browse the repository at this point in the history
This improves performance (hard to measure, but I saw up to 25%
improvement on raw hashsplitting on large files); most likely by
reducing the python/C jumps.

However, the main motivation isn't a performance *improvement* as
much, but rather the ability to adjust the number of low bits that
need to be 1 for a split to occur. Increasing this (decreasing is
currently not support and not very useful) would lead to a larger
average blob size, and thus to fewer blobs (assuming files mostly
aren't small enough to not get split).

Actually using this new variation isn't enabled now, but we can
now pass the number of bits to the new HashSplitter class, which
was somewhat the point because I didn't want to pass the number
from python to C for every blob as that would likely have hurt
performance significantly. Having the class in C allows setting
this up just once.

Note that the hashsplit algorithm fanout quirk is kept even in the
higher bits cases, in order to be compatible with repos created
with modified versions of bup (with different BUP_BLOBBITS).

The new HashSplitter API now embodies most of the old python
code (which is removed), and is used as

 hs = HashSplitter(<iterable of files>,
                   progress=<callable or None>,
                   bits=<number of bits for split, 13 .. 21>,
                   keep_boundaries=<truth value>,
                   fanbits=<bits in each level>)

Where all parameters except for the file list are optional,
and the "files" really only need to properly implement the
"read(max_size)" method, but things are optimised for real
files that provide their fd via "fileno()".

Note also that I removed a number of tests that were testing
parts of the old code that are no longer standalone, but I added
a number of tests for testing various bits. Also, since it's all
in C now, the tests can no longer override _helpers.splitbuf(),
so I added a few randomly created test blobs that result in the
desired number of split bits for various tests.

Finally, for testing purposes, expose the raw algorithm as the
"rollsum()" function, but remove lots of other unused code.

Signed-off-by: Johannes Berg <[email protected]>
Reviewed-by: Rob Browning <[email protected]>
[[email protected]: drop python 2 support; adjust mmap error check;
 return rollsum to py via unsigned long, and check overflow]
Signed-off-by: Rob Browning <[email protected]>
Tested-by: Rob Browning <[email protected]>
  • Loading branch information
jmberg authored and rlbdv committed Nov 20, 2022
1 parent 8cc0615 commit 4ec5c15
Show file tree
Hide file tree
Showing 9 changed files with 793 additions and 517 deletions.
2 changes: 1 addition & 1 deletion GNUmakefile
Original file line number Diff line number Diff line change
Expand Up @@ -197,7 +197,7 @@ lib/cmd/bup: lib/cmd/bup.c src/bup/compat.c src/bup/io.c

clean_paths += lib/bup/_helpers$(soext)
generated_dependencies += lib/bup/_helpers.d
lib/bup/_helpers$(soext): lib/bup/_helpers.c lib/bup/bupsplit.c
lib/bup/_helpers$(soext): lib/bup/_helpers.c lib/bup/bupsplit.c lib/bup/_hashsplit.c
$(CC) $(helpers_cflags) $(CPPFLAGS) $(CFLAGS) $^ \
$(helpers_ldflags) $(LDFLAGS) -o $@

Expand Down
Loading

0 comments on commit 4ec5c15

Please sign in to comment.