forked from fast-pack/FastPFor
-
Notifications
You must be signed in to change notification settings - Fork 0
The FastPFOR C++ library: Fast integer compression
License
jvictor0/FastPFor
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
The FastPFOR C++ library : Fast integer compression Daniel Lemire, Leonid Boytsov, Owen Kaser and Maxime Caron == What is this? == A research library with integer compression schemes. It should be suitable for applications to d-gap compression and delta coding in general. == License == APL 2.0. == Warning == This code requires a (recent as of 2012) compiler supporting C++11. This was a design decision. You can specify which C++ compiler you are using with the YOURCXX variable. e.g., under bash type export YOURCXX=g++-4.7 Below is a description of how to install GCC 4.7: ==== GCC 4.7 under Linux ==== Under Linux, installing GCC 4.7 is not difficult. For example, for Linux Mint you need to edit two files to include "debian experimental": http://community.linuxmint.com/tutorial/view/250 then type sudo apt-get install gcc-4.7 g++-4.7 ==== GCC 4.7 under Mac ==== Mac Ports supports gcc 4.7. http://www.macports.org/ == Why C++11? == With minor changes, all schemes will compile fine under compilers that do not support C++11. And porting the code to C should not be a challenge. == What if I prefer Java? == Many schemes cannot be efficiently ported to Java. However some have been. Please see: https://github.com/lemire/JavaFastPFOR == Requirements == You need GNU GCC 4.7 or better. The code was tested under Linux and MacOS. == What about LLVM/clang++? == Clang 3.1 very nearly can compile everything, but its support for C++11 is still insufficient. We hope that the next release of Clang (3.2?) will correct these limitations. == Building and testing == make ./unit Note that we are thorough in the unit tests so it can take several minutes to run through them all. == Simple benchmark == make codecs ./codecs --clusterdynamic ./codecs --uniformdynamic == Processing data files == Typing make will generate an inmemorybenchmark executable that can process data files. You can use it to process arrays on (sorted!) integers on disk using the following 32-bit format: 1 unsigned 32-bit integer indicating array length followed by the corresponding number of 32-bit integer. Repeat. ( It is assumed that the integers are sorted.) Once you have such a binary file somefilename you can process it with our inmemorybenchmark: ./inmemorybenchmark --minlength 10000 somefilename The "minlength" flag skips short arrays. (Warning: timings over short arrays are unreliable.) == Testing with the ClueWeb09 data set == Grab the data set from: http://boytsov.info/datasets/clueweb09gap/ Using the provided software, uncompress it and run the "toflat" executable to create a "clueweb09.bin" file then run: ./inmemorybenchmark --minlength 10000 clueweb09.bin Note: processing can take over an hour. == Testing with the Gov2 data set == You can test the library over d-gaps data from the TREC GOV2 data set that was made graciously available by Fabrizio Silvestri, Takeshi Yamamuro and Rossano Venturini. Go to: http://integerencoding.isti.cnr.it/?page_id=8 Download both the software and the gov.sort.tar file. Untar the tar file: $ tar xvf gov2.sort.tar You may want to make the newly generated files non-writeable (I'm paranoid): $ chmod -w gov2.sort.Delta gov2.sort.Delta.TOC Build the software (you need the decoders executable) and You need to run this command ./decoders 3 gov2.sort somefilename where "3" is for delta. Then you should be able to test with out inmemorybenchmark: ./inmemorybenchmark --minlength 10000 somefilename.DEC Note: processing can take over an hour. == I used your code and I get segmentation faults == Our code is thoroughly tested. One common issue is that people do not provide large enough buffers. Some schemes can have such small compression rates that the compressed data generated will be much larger than the input data. Also, make sure that all provided buffers are 16-byte aligned or else, some SSE instructions may fail. == Is any of this code subject to patents? == I (D. Lemire) did not patent anything. However, we implemented varint-G8UI which was patented by its authors.
About
The FastPFOR C++ library: Fast integer compression
Resources
License
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published