Two quick POCs have been done using the FAISS and nmslib libraries.
According to the benchmarking done by ann-benchmarks, which is a benchmarking environment for approximate nearest neighbor algorithms search, hnsw (faiss) and hnsw (nmslib) have some of the best performances from a speed and recall standpoint.
The experiments were performed using CPU only. We used the market1501 dataset which contains around 32000 images, using generic embeddings generated by Resnet (512d) and Alexnet (4096d).
Assuming you have conda and python installed in your system already.
Easiest way to install FAISS is via conda:
# CPU version only
conda install faiss-cpu -c pytorch
To test the installation, run:
python faiss-installation-test.py
It runs some dummy code in FAISS and should return a number without throwing any errors if installation is successful. If it fails, more detailed instructions here.
Load the dataset in to VM. I copied from local computer to VM using the below:
# Run this in your local machine if you want to copy data zip to VM
scp 'New_Archive.zip' <username>@<your-vm-ip>:/<desired-vm-file-path>/
Change to the project directory and unzip to the "data" folder. Dataset contains images and corresponding embeddings.
unzip New_Archive.zip -d ./data
Exit and ssh back into your VM using:
ssh -L 8888:localhost:8888 <vm-username>@<your-vm-ip>
The line above will enable you to access Jupyter using localhost in your computer's browser.
Run Jupyter Notebook in your VM using:
jupyter notebook
Then, navigate to the browser in your computer using the localhost link generated when you ran the jupyter notebook command. You can test out the notebooks now.
Easiest way to install nmslib is via pip. Make sure you have Python 3.x. Nmslib python library install instructions here.
First, install the python dev tools:
sudo apt-get install python3-dev
Then do a pip install:
pip install nmslib
Exit and ssh back into your VM using:
ssh -L 8888:localhost:8888 <vm-username>@<your-vm-ip>
The line above will enable you to access Jupyter using localhost in your computer's browser.
Run Jupyter Notebook in your VM using:
jupyter notebook
You can now run the notebooks that use nmslib (like market1501-nmslib.ipynb).
Here are some observations made while doing the experiments:
- Loading in 4096 dimension embeddings via txt files is not feasible. Takes over an 1.5h and still did not finish. Another storage format is needed (e.g. h5).
- Need to do benchmarking to determine best indexing strategy and parameters. Indexing strategy and parameters need to be carefully considered. Lots of tradeoffs in runtime vs accuracy, and the types of indexes/parameters you can use for optimization.
- Need benchmarking to determine best distance metric to use. For market1501 L2 worked the best, but what about other datasets?
- Recall was not calculated as part of the POC. Need to compute ground truth data first.