Hnswlib - fast approximate nearest neighbor search

Header-only C++ HNSW implementation with python bindings. Paper code for the HNSW 200M SIFT experiment

NEWS:

Thanks to Louis Abraham (@louisabraham) hnswlib is now can be installed via pip!

Highlights:

Lightweight, header-only, no dependencies other than C++ 11.
Interfaces for C++, python and R (https://github.com/jlmelville/rcpphnsw).
Has full support for incremental index construction. Has support for element deletions (currently, without actual freeing of the memory).
Can work with custom user defined distances (C++).
Significantly less memory footprint and faster build time compared to current nmslib's implementation.

Description of the algorithm parameters can be found in ALGO_PARAMS.md.

Python bindings

Supported distances:

Distance	parameter	Equation
Squared L2	'l2'	d = sum((Ai-Bi)^2)
Inner product	'ip'	d = 1.0 - sum(Ai*Bi))
Cosine similarity	'cosine'	d = 1.0 - sum(AiBi) / sqrt(sum(AiAi) * sum(Bi*Bi))

Note that inner product is not an actual metric. An element can be closer to some other element than to itself.

For other spaces use the nmslib library https://github.com/nmslib/nmslib.

Short API description

hnswlib.Index(space, dim) creates a non-initialized index an HNSW in space space with integer dimension dim.

Index methods:

init_index(max_elements, ef_construction = 200, M = 16, random_seed = 100) initializes the index from with no elements.
- max_elements defines the maximum number of elements that can be stored in the structure(can be increased/shrunk).
- ef_construction defines a construction time/accuracy trade-off (see ALGO_PARAMS.md).
- M defines tha maximum number of outgoing connections in the graph (ALGO_PARAMS.md).
add_items(data, data_labels, num_threads = -1) - inserts the data(numpy array of vectors, shape:N*dim) into the structure.
- labels is an optional N-size numpy array of integer labels for all elements in data.
- num_threads sets the number of cpu threads to use (-1 means use default).
- Thread-safe with other add_items calls, but not with knn_query.
mark_deleted(data_label) - marks the element as deleted, so it will be ommited from search results.
resize_index(new_size) - changes the maximum capacity of the index. Not thread safe with add_items and knn_query.
set_ef(ef) - sets the query time accuracy/speed trade-off, defined by the ef parameter ( ALGO_PARAMS.md). Note that the parameter is currently not saved along with the index, so you need to set it manually after loading.
knn_query(data, k = 1, num_threads = -1) make a batch query for k closests elements for each element of the
- data (shape:N*dim). Returns a numpy array of (shape:N*k).
- num_threads sets the number of cpu threads to use (-1 means use default).
- Thread-safe with other knn_query calls, but not with add_items.
load_index(path_to_index, max_elements = 0) loads the index from persistence to the uninitialized index.
- max_elements(optional) resets the maximum number of elements in the structure.
save_index(path_to_index) saves the index from persistence.
set_num_threads(num_threads) set the default number of cpu threads used during data insertion/querying.
get_items(ids) - returns a numpy array (shape:N*dim) of vectors that have integer identifiers specified in ids numpy vector (shape:N). Note that for cosine similarity it currently returns normalized vectors.
get_ids_list() - returns a list of all elements' ids.
get_max_elements() - returns the current capacity of the index
get_current_count() - returns the current number of element stored in the index

Python bindings examples

import hnswlib
import numpy as np

dim = 128
num_elements = 10000

# Generating sample data
data = np.float32(np.random.random((num_elements, dim)))
data_labels = np.arange(num_elements)

# Declaring index
p = hnswlib.Index(space = 'l2', dim = dim) # possible options are l2, cosine or ip

# Initing index - the maximum number of elements should be known beforehand
p.init_index(max_elements = num_elements, ef_construction = 200, M = 16)

# Element insertion (can be called several times):
p.add_items(data, data_labels)

# Controlling the recall by setting ef:
p.set_ef(50) # ef should always be > k

# Query dataset, k - number of closest elements (returns 2 numpy arrays)
labels, distances = p.knn_query(data, k = 1)

An example with updates after serialization/deserialization:

import hnswlib
import numpy as np

dim = 16
num_elements = 10000

# Generating sample data
data = np.float32(np.random.random((num_elements, dim)))

# We split the data in two batches:
data1 = data[:num_elements // 2]
data2 = data[num_elements // 2:]

# Declaring index
p = hnswlib.Index(space='l2', dim=dim)  # possible options are l2, cosine or ip

# Initing index
# max_elements - the maximum number of elements (capacity). Will throw an exception if exceeded
# during insertion of an element.
# The capacity can be increased by saving/loading the index, see below.
#
# ef_construction - controls index search speed/build speed tradeoff
#
# M - is tightly connected with internal dimensionality of the data. Strongly affects memory consumption (~M)
# Higher M leads to higher accuracy/run_time at fixed ef/efConstruction

p.init_index(max_elements=num_elements//2, ef_construction=100, M=16)

# Controlling the recall by setting ef:
# higher ef leads to better accuracy, but slower search
p.set_ef(10)

# Set number of threads used during batch search/construction
# By default using all available cores
p.set_num_threads(4)


print("Adding first batch of %d elements" % (len(data1)))
p.add_items(data1)

# Query the elements for themselves and measure recall:
labels, distances = p.knn_query(data1, k=1)
print("Recall for the first batch:", np.mean(labels.reshape(-1) == np.arange(len(data1))), "\n")

# Serializing and deleting the index:
index_path='first_half.bin'
print("Saving index to '%s'" % index_path)
p.save_index("first_half.bin")
del p

# Reiniting, loading the index
p = hnswlib.Index(space='l2', dim=dim)  # the space can be changed - keeps the data, alters the distance function.

print("\nLoading index from 'first_half.bin'\n")

# Increase the total capacity (max_elements), so that it will handle the new data
p.load_index("first_half.bin", max_elements = num_elements)

print("Adding the second batch of %d elements" % (len(data2)))
p.add_items(data2)

# Query the elements for themselves and measure recall:
labels, distances = p.knn_query(data, k=1)
print("Recall for two batches:", np.mean(labels.reshape(-1) == np.arange(len(data))), "\n")

Bindings installation

You can install from sources:

apt-get install -y python-setuptools python-pip
pip3 install pybind11 numpy setuptools
cd python_bindings
python3 setup.py install

or you can install via pip: pip install hnswlib

Other implementations

Non-metric space library (nmslib) - main library(python, C++), supports exotic distances: https://github.com/nmslib/nmslib
Faiss libary by facebook, uses own HNSW implementation for coarse quantization (python, C++): https://github.com/facebookresearch/faiss
Code for the paper "Revisiting the Inverted Indices for Billion-Scale Approximate Nearest Neighbors" (current state-of-the-art in compressed indexes, C++): https://github.com/dbaranchuk/ivf-hnsw
TOROS N2 (python, C++): https://github.com/kakao/n2
Online HNSW (C++): https://github.com/andrusha97/online-hnsw)
Go implementation: https://github.com/Bithack/go-hnsw
Python implementation (as a part of the clustering code by by Matteo Dell'Amico): https://github.com/matteodellamico/flexible-clustering
Java implementation: https://github.com/jelmerk/hnswlib
Java bindings using Java Native Access: https://github.com/stepstone-tech/hnswlib-jna
.Net implementation: https://github.com/microsoft/HNSW.Net

Contributing to the repository

Contributions are highly welcome!

Please make pull requests against the develop branch.

200M SIFT test reproduction

To download and extract the bigann dataset:

python3 download_bigann.py

To compile:

cmake .
make all

To run the test on 200M SIFT subset:

./main

The size of the bigann subset (in millions) is controlled by the variable subset_size_milllions hardcoded in sift_1b.cpp.

Updates test

To generate testing data (from root directory):

cd examples
python update_gen_data.py

To compile (from root directory):

mkdir build
cd build
cmake ..
make

To run test without updates (from build directory)

./test_updates

To run test with updates (from build directory)

./test_updates update

HNSW example demos

Visual search engine for 1M amazon products (MXNet + HNSW): website, code, demo by @ThomasDelteil

References

Malkov, Yu A., and D. A. Yashunin. "Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs." TPAMI, preprint: https://arxiv.org/abs/1603.09320

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Hnswlib - fast approximate nearest neighbor search

Python bindings

Supported distances:

Short API description

Python bindings examples

Bindings installation

Other implementations

Contributing to the repository

200M SIFT test reproduction

Updates test

HNSW example demos

References

Files

README.md

Latest commit

History

README.md

File metadata and controls

Hnswlib - fast approximate nearest neighbor search

Python bindings

Supported distances:

Short API description

Python bindings examples

Bindings installation

Other implementations

Contributing to the repository

200M SIFT test reproduction

Updates test

HNSW example demos

References