Skip to content

String map implementation through Fast Succinct Trie

License

Notifications You must be signed in to change notification settings

nonego/fast_succinct_trie

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

39 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

fast_succinct_trie

This library provides a string map through Fast Succinct Trie (FST), proposed in SIGMOD 2018. The library is implemented by modifying the original FST implementation efficient/SuRF and applying a compact minimal-prefix trie form to simulate a trie-based string map (c.f. Section 2.2 of KAIS 2017).

What is FST?

FST is a succinct trie data structure proposed in the paper,

Zhang, Lim, Leis, Andersen, Kaminsky, Keeton and Pavlo: SuRF: Practical Range Query Filtering with Fast Succinct Trie, In SIGMOD 2018, pp. 323-336.

Briefly, FST is a practical variant of LOUDS-trie. FST uses two LOUDS implementations: one is fast and the other is space-efficient. FST partitions a trie into two layers at a level and applies the fast one to the top layer and the space-efficient one to the bottom layer. More specific explanations can be found in the slide by the author.

Since FST was developed for succinct range query filtering, the original implementation allows us to include false positives in query solutions. This library modifies it and provides a string map based on FST.

Install

This library consists of only header files. Please through the path to the directory include.

Build instructions

You can download and compile this library as the following commands:

$ git clone https://github.com/kampersanda/fast_succinct_trie.git
$ cd fast_succinct_trie
$ mkdir build
$ cd build
$ cmake ..
$ make -j

Sample usage

#include <fstream>
#include <iostream>

#include <fst.hpp>

int main() {
    std::vector<std::string> keys = {
        "ACML",  "AISTATS", "DS",    "DSAA",   "ICDM",   "ICML",  //
        "PAKDD", "SDM",     "SIGIR", "SIGKDD", "SIGMOD",
    };

    // a trie-index constructed from string keys sorted
    fst::Trie trie(keys);

    // keys are mapped to unique integers in the range [0,#keys)
    std::cout << "[searching]" << std::endl;
    for (size_t i = 0; i < keys.size(); ++i) {
        fst::position_t key_id = trie.exactSearch(keys[i]);
        std::cout << " - " << keys[i] << ": " << key_id << std::endl;
    }

    std::cout << "[statistics]" << std::endl;
    std::cout << " - number of keys: " << trie.getNumKeys() << std::endl;
    std::cout << " - number of nodes: " << trie.getNumNodes() << std::endl;
    std::cout << " - number of suffix bytes: " << trie.getSuffixBytes() << std::endl;
    std::cout << " - memory usage in bytes: " << trie.getMemoryUsage() << std::endl;
    std::cout << " - output file size in bytes: " << trie.getSizeIO() << std::endl;

    std::cout << "[configure]" << std::endl;
    trie.debugPrint(std::cout);

    // write the trie-index to a file
    {
        std::ofstream ofs("fst.idx");
        trie.save(ofs);
    }

    // read the trie-index from a file
    {
        fst::Trie other;
        std::ifstream ifs("fst.idx");
        other.load(ifs);
    }

    std::remove("fst.idx");
    return 0;
}

The output will be

[searching]
 - ACML: 1
 - AISTATS: 2
 - DS: 4
 - DSAA: 5
 - ICDM: 6
 - ICML: 7
 - PAKDD: 0
 - SDM: 3
 - SIGIR: 8
 - SIGKDD: 9
 - SIGMOD: 10
[statistics]
 - number of keys: 11
 - number of nodes: 19
 - number of suffix bytes: 24
 - memory usage in bytes: 587
 - output file size in bytes: 312
[configure]
-- LoudsDense (heigth=1) --
LABEL: A D I P S | 
CHILD: 1 1 1 0 1 | 
PREFX: 0         |
-- LoudsSparse --
LABEL: C I S C D I ? A D M G I K M ? 
CHILD: 0 0 1 1 0 1 0 0 0 0 1 0 0 0 
LOUDS: 1 0 1 1 1 0 1 0 1 0 1 1 0 0 
-- Suffixes --
POINTERS: 17 11 1 9 0 22 9 12 7 19 14 
SUFFIXES: ? S T A T S ? R ? M ? M L ? O D ? A K D D ? A ?  

Todo

  • Support more operations
  • Implement a normal trie form also

Licensing

This library is free software provided under Apache License 2.0, following the License of efficient/SuRF. The modifications are shown in each source file.

About

String map implementation through Fast Succinct Trie

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • C++ 98.2%
  • Python 1.1%
  • CMake 0.7%