Skip to content

The FastPFOR C++ library: Fast integer compression

Notifications You must be signed in to change notification settings

seanjensengrey/FastPFor

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 

Repository files navigation

The FastPFOR C++ library : Fast integer compression
Daniel Lemire, Leonid Boytsov, Owen Kaser and Maxime Caron


== What is this? ==

A research library with integer compression schemes.
It should be suitable for applications to d-gap
compression and delta coding in general.


== License ==

APL 2.0.

== Warning ==

This code requires a (recent as of 2012) compiler supporting
C++11. This was a design decision. You can specify which
C++ compiler you are using with the YOURCXX variable.

e.g., under bash type

export YOURCXX=g++-4.7

Below is a description of how to install GCC 4.7:

==== GCC 4.7 under Linux ====


Under Linux, installing GCC 4.7 is not difficult. For example,
for Linux Mint you need to edit two files to include "debian
experimental":

http://community.linuxmint.com/tutorial/view/250

then type 

sudo apt-get install gcc-4.7 g++-4.7

==== GCC 4.7 under Mac ====

Mac Ports supports gcc 4.7.

http://www.macports.org/

== Why C++11? ==

With minor changes, all schemes will compile fine under
compilers that do not support C++11. And porting the code
to C should not be a challenge.

== What if I prefer Java? ==

Many schemes cannot be efficiently ported to Java. However
some have been. Please see:

https://github.com/lemire/JavaFastPFOR

== Requirements ==

You need GNU GCC 4.7 or better. The code
was tested under Linux and MacOS. 

== What about LLVM/clang++? ==

Clang 3.1 very nearly can compile everything, but its support
for C++11 is still insufficient. We hope that the next release
of Clang (3.2?) will correct these limitations.


== Building and testing ==

make
./unit

Note that we are thorough in the unit tests so it can
take several minutes to run through them all.

== Simple benchmark ==

make codecs
./codecs --clusterdynamic
./codecs --uniformdynamic

== Processing data files ==

Typing make will generate an inmemorybenchmark 
executable that can process data files. 

You can use it to process arrays on (sorted!) integers
on disk using the following 32-bit format: 1 unsigned 32-bit
integer  indicating array length followed by the corresponding
number of 32-bit integer. Repeat.
 
 ( It is assumed that the integers are sorted.)


Once you have such a binary file somefilename you can
process it with our inmemorybenchmark:

./inmemorybenchmark --minlength 10000 somefilename

The "minlength" flag skips short arrays. (Warning: timings over
short arrays are unreliable.)


== Testing with the ClueWeb09 data set ==

Grab the data set from:

 http://boytsov.info/datasets/clueweb09gap/

Using the provided software, uncompress it and
run the "toflat" executable to create a "clueweb09.bin" file
then run:

./inmemorybenchmark --minlength 10000 clueweb09.bin

Note: processing can take over an hour.

== Testing with the Gov2 data set ==
 
You can test the library over d-gaps data
from the TREC GOV2 data set that was made graciously
available by   Fabrizio Silvestri,  Takeshi Yamamuro
and Rossano Venturini.

Go to:

http://integerencoding.isti.cnr.it/?page_id=8

Download both the software and the gov.sort.tar file.

Untar the tar file:

$ tar xvf gov2.sort.tar

You may want to make the newly generated files non-writeable
(I'm paranoid):

 $ chmod -w gov2.sort.Delta gov2.sort.Delta.TOC 

Build the software (you need the decoders executable)
and 

You need to run this command

./decoders 3 gov2.sort somefilename

where "3" is for delta.

Then you should be able to test with out inmemorybenchmark:

./inmemorybenchmark --minlength 10000 somefilename.DEC

Note: processing can take over an hour.



== I used your code and I get segmentation faults  ==

Our code is thoroughly tested.

One common issue is that people do not provide large enough buffers.
Some schemes can have such small compression rates that the compressed data 
generated will be much larger than the input data.

Also, make sure that all provided buffers are 16-byte aligned or else,
some SSE instructions may fail.

== Is any of this code subject to patents?  ==

I (D. Lemire) did not
patent anything. 

However, we implemented
varint-G8UI which was patented by its authors. 

About

The FastPFOR C++ library: Fast integer compression

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published