-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unpredictable compression speed using "--optimal" level #11
Comments
Firstly, thank you for such a well written and detailed report! I am afraid this behavior of the The usual way to avoid such behavior is to limit the match search in some way (for instance stop when a certain number of potential matches have been checked, or when a match of acceptable length has been found -- this is what level 9 does). But in order to find the bit-optimal compressed representation for the BriefLZ format, it has to consider all possible matches. I would suggest sticking to levels 1-9 with the current version of BriefLZ if your data may trigger this. The function description of |
Very glad to see BriefLZ performing in Kirill's SCB, just would like to see the effectiveness in action, therefore Joergen, please provide the options you wanna see.
And one quick console dump:
As for the 'SILVA' ~1GB DNA:
On i5-7200u 2.7GHz my old (a year ago) test shows:
So, it would be nice to see best of BriefLZ juxtaposed to other performers, especially to my toy. |
It's my pleasure, and thank you for reply!
Thank you for explanation. (May be this could be worth mentioning in the readme?)
Problem is that I did not know that my data will trigger it. Testing with smaller data did not reveal it, until trying a 3 GB dataset. I'm afraid someone else might experience the same situation, and not have the patience (or time budget) to wait for compression to finish.
I'm afraid it would not be enough. It's technically and strictly speaking correct to say "very slow" and "very slow depending on the type of data", but it will still not enable the user to build accurate model of what to expect. The slowdown is ~100 times compared to its normal performance on other data. From the viewpoint of the user, it's indistinguishable from blzpack being frozen. If I experienced this slowdown using blzpack for actual data compression (rather than benchmarking), I would certalinly kill the process (rather than waiting 27 hours) and conclude that it's broken. (And re-consider whether I can rely on a this broken compressor at all, even using other levels). I would recommend clearly explaining this possible slowdown in a way that's hard to miss. E.g. something like: "Important note about --optimal mode: Please use it only for testing, and never in production. It's compression speed is unpredictable and can get 100 times slower depending on data. It should be only used with unlimited time budget." I fully understand being curious about the ultimate limit of the compression ratio, thus in the --optimal mode. I just think the user normally will not expect the slowdown, and it will help to add warning about the slowdown. |
In addition, here is a chart comparing compression speed of several compressors: I included only general purpose compressors, and all datasets. Only strongest level (single thread) of each compressor is shown. You can highlight each compressor by clicking on its marker in the legend. This may help to compare the variation in compression speed of BriefLZ with other popular compressors. |
Your SCB is the best in many regards/departments, it serves quite well in estimating the effectiveness in real scenarios. These old results of Nakamichi are bettered by the 2020-May-09 revision, personally I would love to see a clash/showdown of Lizard 49 vs Nakamichi vs BriefLZ (Joergen's favorite options) vs pigz-11-4t, with: Also, when decompressing from non-remote i.e. local (on an SSD SATA 3 with 6GB/s) i.e. Link speed: 6000 Mbit/s (for estimating transfer time): |
Thanks, Georgi. A quick note about BriefLZ decompression speed results in these charts. Currently it is handicapped by not supporting streaming output. I.e., blzpack always outputs to a file (as far as I understand). However in my benchmark I always measure streaming decompression speed. Therefore, for compressors without such streaming mode (such as blzpack), I let them decompress to a file, then stream this file, and measure the total time. If blzpack will get support for streaming mode, its decompression speed results will improve. (This happened with Nakamich after Georgi added such streaming mode recently). Streaming mode is important for many practical applications, where you don't want to store decompressed data in a file, but instead process it on the fly with another tool, as part of data analysis pipeline. (I also use only streaming mode during compression, but compression is much slower so the impact is comparatively smaller for blzpack. However its compression speed results should also improve with addition of streaming mode). (Note that only uncompressed data is streamed in my benchmark. For compressed data it's OK to work with a file directly). |
Thank you for the comments both.
The example program
The example program is limited to a block size of roughly 3.6 GiB ( The compression library itself is limited by the size of the type used to store offsets and lengths ( The optimal level uses 5 * block size words of work memory, which for 32-bit values is 20N bytes. Add to that the current input and output blocks which are also held in memory and you get 22N. For reference (in 1.3.0) levels 1-4 use a fixed amount of workmem, levels 5-7 use 12N bytes of workmem, and levels 8-10 use 20N bytes of workmem.
Usually not. Lower compression levels limit the search depth (at level 1 only the most recent match is checked), so using larger block sizes may have less of an impact than on higher levels. That being said, you could of course construct inputs where it would matter hugely, like 1GiB of random data repeated. Playing around with some genome data today, I must admit sometimes it gives surprising results compared to other types of data.
If you have the ~80 GiB of RAM it would take, then From a few quick tests, it does not appear that the block size has a huge effect on the compression ratio on the genome data.
That is a good point, I will try to come up with a suitable warning.
If by streaming you mean the ability to read blocks from stdin and write blocks to stdout then that would be possible to add. |
I'll try to add these settings to the benchmark in the future.
Thanks.
Yes, this is exactly what I mean, sorry that it was unclear. BriefLZ decompression is fast enough for this to matter in my tests, so this would be a very welcome addition. |
Joergen, you are welcome to see my latest post (inhere my old web-browser prohibits me of posting attachments and pictures), looking forward to seeing how BriefLZ performs with bigger blocks, for now, I included the default use: |
Regarding streaming, if you are on Linux you could try simply specifying stdin and stdout as filenames, like: ./blzpack -9 /dev/stdin /dev/stdout <foo >foo.blz |
BriefLZ compression speed with "--optimal" varies wildly depending on input data.
I used "blzpack --optimal" to compress several biological datasets. Most of time it compresses at speed of about 1 to 3 MB/s (sometimes 6). However, when compressing human genome, its speed dropped to 33 kB/s. I tried it twice, both time it took ~27 hours to complete.
In case if this is normal and expected behaviour, I think it should be documented, so that users know the risks of using "--optimal" mode.
blzpack seems to works fine with other settings. I used BriefLZ 1.3.0, commit 0ab07a5, built using instructions from readme. I used only default block size so far, though I plan to test it with other block sizes too.
Test machine: Ubuntu 18.04.1 LTS, dual Xeon E5-2643v3, 128 MB RAM, no other tasks running.
This test was a part of Sequence Compression Benchmark: http://kirr.dyndns.org/sequence-compression-benchmark/ - this website includes all test data, commands, and measurements.
In particular, all test data is available at: http://kirr.dyndns.org/sequence-compression-benchmark/?page=Datasets
Compression and decompression speed of the "blzpack --optimal" on all datasets:
Please let me know if you need any other details or help reproducing this issue.
The text was updated successfully, but these errors were encountered: