A Fast and Memory-Efficient HAT-Trie Implementation Based on Tessil hat-trie (NIF based). Learn more about the HAT-Trie data structure here.
- The
etrie
data structure reference should be operated from the same process where it was created to avoid concurrent operations. - An
etrie
reference is garbage-collected by the Erlang VM when there is no longer any reference to it. References can be stored, passed between processes on the same node, and stored in an ETS table. - Methods such as
new/1
,from_list/1
,from_list/2
,to_list/1
are implemented as dirty NIFs to handle large lists efficiently and prevent Erlang scheduler blocking. etrie
stores the HAT-Trie values of types such as atoms, numeric, or binary strings as native data (uint64
,int64
, ordouble
for numeric data,uint64
for atoms andstd::string
for binary strings). The other data types like lists, tuples, etc., the storage is done by encoding the value usingerlang:term_to_binary
, which may contribute to additional CPU usage and memory consumption. It's essential to note that beside the data type itself there is some other additional overhead introduced by the wrappers and the use ofstd::variant
(a C++ type-safeunion
) but should be insignificant.
- The library provides an efficient and compact way to store a map of strings by compressing common prefixes. It also allows searching for keys that match a prefix.
- The default parameters are optimized for exact searches. If you perform many prefix searches, consider adjusting the burst threshold
(default to 16384)
through the*_burst_threshold
methods or during structure creation. - Keys must be binary strings, and the maximum key size is limited to
65534
bytes. - The capacity of the HAT-Trie is restricted to a maximum of
4294967296
items. - Due to the limitations of the NIF API, references returned from other NIFs are encoded using
term_to_binary/1
. If you do not store a reference to these externally, there is a possibility that they may be subject to garbage collection. It is important to note that references toetrie
objects are appropriately managed.
Ensure you have at least Erlang OTP 19.0, as enif_binary_to_term
and enif_term_to_binary
are not available in previous versions.
Compile:
rebar3 compile
Simple usage:
{ok, T} = etrie:new(),
ok = etrie:insert(T, <<"one">>, 1),
ok = etrie:insert(T, <<"two">>, 2),
ok = etrie:insert(T, <<"three">>, 3),
ok = etrie:insert(T, <<"four">>, 4),
{ok,[{<<"four">>,4}, {<<"two">>,2}, {<<"three">>,3}, {<<"one">>,1}]} = etrie:to_list(T),
{ok, 4} = etrie:lookup(T, <<"four">>),
{ok,[{<<"two">>,2},{<<"three">>,3}]} = etrie:lookup_range(T, <<"t">>),
{ok, <<"three">>, 3} = etrie:longest_prefix(T, <<"three tables">>).
new()
- Creates a new HAT-Trie instance.new(BurstThreshold)
- Creates a new HAT-Trie instance with a specifiedburst_threshold
.from_list(Items)
- Creates a new HAT-Trie from a proplist.from_list(Items, BurstThreshold)
- Creates a new HAT-Trie from a proplist with a specifiedburst_threshold
.to_list(Ref)
- Returns all items from the HAT-Trie as a proplist.insert(Ref, Key, Value)
- Adds a new item to the HAT-Trie.remove(Ref, Key)
- Removes an item with a specified key from the HAT-Trie.remove_prefix(Ref, Key)
- Removes all items with a specified prefix from the HAT-Trie.clear(Ref)
- Removes all items from the HAT-Trie.is_empty(Ref)
- Indicates if the HAT-Trie is empty or not.size(Ref)
- Gets the number of items in the HAT-Trie.longest_prefix(Ref, Prefix)
- Gets the item that matches the longest prefix in the HAT-Trie.lookup(Ref, Key)
- Gets the item that matches a certain key.lookup_range(Ref, Prefix)
- Gets all items that match a certain key.is_member(Ref, Key)
- Indicates if a key is a member of the HAT-Trie.get_burst_threshold(Ref)
- Gets the current burst threshold value.set_burst_threshold(Ref, BurstThreshold)
- Adjusts the burst threshold value.
The benchmarking process is facilitated by the Erlang module (benchmarks/benchmark.erl
), which reads a designated dataset line by line. Each line in the dataset represents a key in the trie, and its associated value is the line number.
The dataset is utilized to perform various operations such as insertions, reads, and deletions, with the elapsed time measured for each operation.
The dataset used represents all the titles from the main namespace of the Wikipedia archive.
To run the benchmark:
make benchmark MODULE=etrie DATASET=/Downloads/enwiki-latest-all-titles-in-ns0 LIMIT=1000000
Where:
MODULE
: the library used to benchmark. Can be one of :etrie
,ok_btrie
orok_trie
.DATASET
: file path of the dataset.LIMIT
: Maximum number of items to load from dataset. If < 1 then load the entire dataset.
Unfortunately the okeuday trie implementation (ok_btrie
and ok_trie
) it's pretty slow and I had to limit the number of items to 2 millions.
Results are:
etrie:insert Time -> operations: 2000000 elapsed: 881.1570 ms, 440.5785 ns/key
etrie:lookup Time -> operations: 2000000 elapsed: 454.7810 ms, 227.3905 ns/key
etrie:remove Time -> operations: 2000000 elapsed: 532.2440 ms, 266.1220 ns/key
ok_btrie:insert Time -> operations: 2000000 elapsed: 60761.0590 ms, 30380.5295 ns/key
ok_btrie:lookup Time -> operations: 2000000 elapsed: 5925.6150 ms, 2962.8075 ns/key
ok_btrie:remove Time -> operations: 2000000 elapsed: 38510.2600 ms, 19255.1300 ns/key
ok_trie:insert Time -> operations: 2000000 elapsed: 61013.2730 ms, 30506.6365 ns/key
ok_trie:lookup Time -> operations: 2000000 elapsed: 2283.5370 ms, 1141.7685 ns/key
ok_trie:remove Time -> operations: 2000000 elapsed: 35662.8770 ms, 17831.4385 ns/key
I faced difficulty in finding an appropriate method to programmatically measure the memory consumption specifically utilized by the trie itself across different libraries. The challenge stems from the fact that the memory allocation by the etrie
NIFs is not exclusively performed with enif_*
methods. Consequently, Erlang lacks detailed statistics regarding the memory allocated directly through native C++ allocation methods.
To gain an understanding of memory allocation, I manually observed the memory usage of the beam.smp
process using the Activity Monitor
on OSX.
Of course the memory reported contains many other things beside the one used by the trie itself but because of the big gap it's clear that etrie
uses less memory.
Results are:
etrie: 680 MB
ok_btrie: 4.67 GB
ok_trie: 5.42 GB
Note: The test employs the line index as a value, which is an integer
and stored by etrie
as a native data type. For other values lie tuples, lists, (see Implementation Details
section) there might be additional overhead caused by term_to_binary
encoding.
In order to run the integrity tests run rebar3 eunit
from project root.