Skip to content
forked from tim-hilt/vlad

Implementation of Vector of Locally Aggregated Descriptors (VLAD)

License

Notifications You must be signed in to change notification settings

yangxingbin/vlad

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

29 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Vector of Locally Aggregated Descriptors (VLAD)


Description

This repository is an implementation of VLAD, which was originally formulated by Hervé Jégou in [1]. The implementation is part of my bachelors-thesis termed "Computer Vision and Machine Learning for marker-free product identification".

VLAD is an algorithm that allows a user to aggregate local descriptors into a compact, global representation. It's derived from the Bag of Features approach [2] and related to the Fisher vector [3].

This repository is a WIP and should not be considered production-ready. I took a bit of inspiration from Jorjassos implementation and tried to make it better. Improved versions of the original formulation are also implemented. I took them from [1, 4, 5] and inserted references in the code.

Dependencies

  • Numpy
  • Scikit-Learn
  • Progressbar2
  • OpenCV (for the examples)

Install-Instructions

TODO

Usage

The API is based on the wonderful Scikit-Learn API, which uses the basic notion of fit/ predict/ transform.

To include VLAD in the current file just write:

from vlad import VLAD

On initialization, the number of visual words (k) and the norming-scheme are given. Norming is a crucial difference in between different implementations [1, 4, 5] with [5] containing the preferable one. To instantiate a VLAD-object write:

vlad = VLAD(k=16, norming="RN")  # Defaults are k=256 and norming="original"

After having instantiated the object you can fit the visual vocabulary with:

vlad.fit(X)

The fit-function also returns the instance (again, sklearn-style), so the following two are equivalent:

vlad = VLAD(k=16, norming="RN")
vlad.fit(X)
# ...
vlad = VLAD(k=16, norming="RN").fit(X)

X is a tensor of image-descriptors (m x d x n), where m is the number of descriptors per image, d is the number of dimensions per descriptor and n is the total number of image-descriptors. It's best to use image-descriptors in euclidean space (Such as SIFT or RootSIFT [6]), rather than in hamming space, as the KMeans-clustering won't work properly with hamming-descriptors.

Whenever a visual dictionary is fitted, the dictionary is saved to disc and can be loaded manually to bypass training.

To check for an image one can write:

vlad.predict(imdesc)  # imdesc is a (m x d) descriptor-matrix

to get the image-index with maximum similarity. Alternatively

vlad.predict_proba(imdesc)

can be used to obtain a Numpy-array with all similarity scores.

If you want to work with the VLAD-descriptors outside of the class, the transform- and fit_transform-functions can be utilized:

vlads = vlad.transform(descriptor_tensor)  # Call on fitted model

vlads = vlad.fit_transform(descriptor_tensor)  # Can be called on a non-fitted model

Documentation

Documentation can be found at [...] TODO

Roadmap

Task Status
Original formulation (SSR, L2) [1] Done
Use RootSIFT-descriptors [6] Done
Try with more descriptors TODO
Try with dense descriptors TODO
Intra-Normalization [4] Done
Residual-Normalization (RN) [5] Done
Local Coordinate System (LCS) [5] Done
Dimensionality-Reduction [7,8] TODO
Quantization [9] Done
Generalization using multiple vocabularies [7] TODO
Make documentation TODO
Include Tests TODO
Include Install-Instructions TODO
Include Usage-Examples Done
Provide example notebooks Done

References

[1]: Jégou, H., Douze, M., Schmid, C., & Pérez, P. (2010, June). Aggregating local descriptors into a compact image representation. In 2010 IEEE computer society conference on computer vision and pattern recognition (pp. 3304-3311). IEEE.

[2]: Sivic, J., & Zisserman, A. (2003, October). Video Google: A text retrieval approach to object matching in videos. In null (p. 1470). IEEE.

[3]: Perronnin, F., Sánchez, J., & Mensink, T. (2010, September). Improving the fisher kernel for large-scale image classification. In European conference on computer vision (pp. 143-156). Springer, Berlin, Heidelberg.

[4]: Arandjelovic, R., & Zisserman, A. (2013). All about VLAD. In Proceedings of theIEEE conference on Computer Vision and Pattern Recognition (pp. 1578-1585).

[5]: Delhumeau, J., Gosselin, P. H., Jégou, H., & Pérez, P. (2013, October).Revisiting the VLAD image representation. In Proceedings of the 21st ACM international conference on Multimedia (pp. 653-656).

[6]: Arandjelović, R., & Zisserman, A. (2012, June). Three things everyone should know to improve object retrieval. In 2012 IEEE Conference on Computer Vision and Pattern Recognition (pp. 2911-2918). IEEE.

[7]: Jégou, H., & Chum, O. (2012, October). Negative evidences and co-occurences in image retrieval: The benefit of PCA and whitening. In European conference on computer vision (pp. 774-787). Springer, Berlin, Heidelberg.

[8]: Jegou, H., Perronnin, F., Douze, M., Sánchez, J., Perez, P., & Schmid, C. (2011). Aggregating local image descriptors into compact codes. IEEE transactions on pattern analysis and machine intelligence, 34(9), 1704-1716.

[9]: Jegou, H., Douze, M., & Schmid, C. (2010). Product quantization for nearest neighbor search. IEEE transactions on pattern analysis and machine intelligence, 33(1), 117-128.

About

Implementation of Vector of Locally Aggregated Descriptors (VLAD)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%