This is a C++ template library for succinct data structures called sdsl.
Succinct data structures are fascinating: They represent an object (like a bitvector, a tree, suffix array,...) in space close the information-theoretic lower bound of the object but the defined operations can still be performed efficiently. Hmmm, at least in theory ;) Actually there is still a big gap between theory and practice. Why? The time complexity of an operations performed on the classical fat data structure and the slim succinct data structure are the same most time in theory. However, in practice succinct structures are slow since the operations require often memory accesses with bad locality of references. Moreover, often the in theory small sub-linear space data structures account for a large amount of memory, since they are only asymptotic sub-linear and the input size for which they are negligible in practice is galactic.
The aim of the library is to provide basic and complex succinct data structure which are
- easy to use (like the STL, which provides classical data structures),
- capable of handling large inputs (yes, we support 64-bit),
- provide excellent performance in construction, and
- provide excellent operation performance
We used several techniques to reach the performance goals:
- bit-parallelism on 64-bit word
- cache-friendly data structure layouts
- semi-external construction algorithms
- the built-in popcount operation
- hugepages to bypass the TLB bottleneck
Details are in our comprehensive experimental study.
- Bitvectors
- An uncompressed mutual bitvector (
bit_vector
) - An uncompressed immutable bitvector (bit_vector_il)
- A -compressed immutable bitvector (rrr_vector<>)
- A bitvector for sparse populated arrays (sd_vector<>)
- An uncompressed mutual bitvector (
- Rank Support (RS) and Select Support (SS)
- Several rank and select implementations with different time-space trade-offs for the uncompressed bitvectors (rank_support_v, rank_support_v5, select_support_mcl, ...)
- Rank and select for compressed bitvectors (rank_support_rrr, rank_support_sd, ...)
- Variable-length Coders
- Elias- coder (coder::elias_delta)
- Fibonacci-coder (coder::fibonacci)
- Integer Vectors
- Mutable vectors for (compile-time) fixed
w
-bit integers (int_vector) - Mutable vector for (run-time) fixed
w
-bit integers (int_vector<0>,w
passed to the constructor) - Immutable compressed integer vectors using a variable-length coder
coder
(enc_vector, vlc_vector)
- Mutable vectors for (compile-time) fixed
- Wavelet Trees (WT) (all immutable)
- Compressed Suffix Arrays (CSA) (all immutable)
- csa_bitcompressed is based on the bitcompressed SA and inverse SA.
- csa_wt is based on a WT of the BWT.
- csa_sada is based on the compressed -function
- Balanced Parentheses Support (BPS) (all immutable)
- A range-min-max-tree implementation (bp_support_sada)
to support operations
find_open
,find_close
,enclose
,double_enclose
,... - Hierarchical solutions with pioneer parentheses (bp_support_g, bp_support_gg)
- A range-min-max-tree implementation (bp_support_sada)
to support operations
- Longest Common Prefix (LCP) Arrays (all immutable)
- lcp_bitcompressed is a bitcompressed version
- lcp_byte encodes small values with one byte and large values with two words
- lcp_dac used direct accessible codes
- lcp_wt stores small values in a WT and large value in on word.
- lcp_vlc stores the values in a
vlc_vector
. - lcp_support_sada uses a bitvector of 2n bits, a select structure supporting it, and the corresponding CSA.
- lcp_support_tree uses the topology of the corresponding CST.
- lcp_support_tree2 uses the corresponding CSA and CST.
- Compressed Suffix Trees (CST) (all immutable)
- Range Minimum/Maximum Query (RMQ) Structures (all immutable)
- Self-contained RMQ structure using 2n+o(n) bits or 4n+o(n) bits (rmq_succinct_sct, rmq_succinct_sada)
- Non-succinct support structure for RMQ (rmq_support_sparse_table)
Let us now show how you can assemble even a very
complex data structure very easily. Lets begin with
the most complex one, a CST!
It basically consists of a CSA, an compressed LCP-array,
and a succinct representation of the tree topology;
each part can be specified by a template parameter.
Say, we want fast navigation operations, so we take
the class cst_sada<cst_type, lcp_type, bp_support_type>
for our CST. Now we can specify the type of CSA.
Lets take a CSA based on wavelet tree:
csa_wt<wt_type, SA_sample_dens, inv_SA_sample_dens>
.
We can recursively specify the used types. So
now we can specify the used wavelet tree, say
a run-length compressed wavelet tree
(wt_rlmn<>
). We could recurse again and specify, each detail
of the wavelet tree (e.g. which rank support structure
should be used) but we stick now with the default
configuration which uses an sd_vector
for the
marking of the heads of the runs in the wavelet tree.
Lets choose at last a LCP array which uses
the topology of the CST and the CSA to
compress the LCP values (lcp_support_tree2
) and
stick with default template parameters for all
types. So the final type looks like this:
cst_sada<cst_wt<wt_rlmn<> >, lcp_support_tree2<> >
.
Now, lets explore the data structure a little bit. We take the english.100MB input from the Pizza&Chili-corpus, construct the CST-object, output its structure, and visualise it using the d3js-library. Have fun with the result.
The data structures in the library can be divided into several classes:
- Objects of mutable classes can be changed after construction (e.g.
we can assign new values to the elements of an
int_vector
) - Objects of immutable classes can not be changed after construction
(e.g. you can not assign a new value to an element of a
compressed suffix array, say
csa_wt
) - Objects of support classes add functionality to objects of
self-contained classes. For example an object of type
rank_support_v
addes constant timerank(i)
-functionality to an object of typebit_vector
, or an object of of typebp_support_sada
addsfind_open(i)
-functionality to abit_vector
object, which represents a balanced parentheses sequence.
Each sdsl-class X
has to implement the following methods:
- The standard constructor
X()
- The copy constructor
X(const &X)
- Swap operator
swap(const &X)
- serialize operator
serialize(std::ostream &out, structure_tree_node* v, std::string name)
- load operator
load(std::istream &in)
We provide many handy methods for sdsl objects:
store_to_file(const T &o, std::string file)
stores objecto
tofile
load_from_file(T &o, std::string file)
loads objecto
fromfile
size_in_bytes(const T &o)
returns the number of bytes needed to represent objecto
in memory.write_structure<FORMAT>(const T &o, std::ostream &out)
writes the structure of objecto
in JSON or R format (FORMAT
=JSON_FORMAT
orR_FORMAT
) intoout
util::clear(T &o)
frees space by setting o to the empty object.- ...for more have a look into the cheat sheet in
extras/cheatsheet
.
The library was successfully tested on the following configurations
- Mac OS X 10.7.3 on a MacBookPro equipped with a Intel Core i5 CPU
- Ubuntu Linux 12.04 running on a server equipped with Intel Xeon (E5640) CPUs
The installation requires that the cmake tool
and a C++ compiler (e.g. from the GNU Compiler Collection)
is installed.
You can than install the library into an directory SDSL_INSTALL_DIR
by
calling
./install SDSL_INSTALL_DIR
.
If SDSL_INSTALL_DIR
is not specified your home directory is used.
The library header files will be located in the directory
SDSL_INSTALL_DIR/include
and the library in the directory
SDSL_INSTALL_DIR/lib
. After the installation you can
execute the tests in the test
directory or start
with some code examples in the examples
folder.
A cheat sheet can be generated by running make in the
extras/cheatsheet
folder.
The test directory contains test code for many library structures. We use googletest framework and make to run the tests. See the README file in the directory for details.
You can find out how efficient the library works on your system by running experiments in the benchmark directory.
Compile the examples with make
and experience
how easy it is to use succinct data structures.
We have included the code of two excellent suffix array construction algorithms.
- Yuta Mori's incredible fast suffix libdivsufsort algorithm (version 2.0.1) for byte-alphabets.
- An adapted version of Jesper Larsson's implementation of the algorithm of Larson and Sadakane for integer-alphabets.
This project profits from excellent input of many coders. Timo Beller improved the construction process during the last month and Matthias Petri contributed new bitvectors and helped a lot in making the library more accessible. Stefan Arnold helped us with tricky template questions. We a grateful to Kalle Karhu, Dominik Kempa, and Shanika Kuruppu for bug reports.
The header file of each data structure should contain all relevant references for the data structure. If you find a header file, where a reference is missing, please contact Simon.
The latest paper about the library itself is:
- Simon Gog, Matthias Petri: Optimized Succinct Data Structures for Massive Data, Accepted for publication in Software, Practice and Experience.