lsh

a pure python locality senstive hashing implementation

Installation

lsh is packaged with setuptools so it can be easily installed with pip like this:

$ cd lsh/
$ [sudo] pip install -e .

Usage

from lsh import LSHCache

cache = LSHCache()
    
docs = [
  "lipstick on a pig",
  "you can put lipstick on a pig",
  "you    can put lipstick on a pig but it's still a pig",
  "you can put lipstick on a pig it's still a pig",
  "i think they put some lipstick on a pig but it's still a pig",
  "putting lipstick on a pig",
  "you know you can put lipstick on a pig",
  "they were going to send us binders full of women",
  "they were going to send us binders of women",
  "a b c d e f",
  "a b c d f"]

dups = {}
for i, doc in enumerate(docs):
    dups[i] = cache.insert(doc.split(), i)
    ...

Tools

analyze_lsh will generate a model set of documents and run lsh over it. It will generate statistics showing how many documents a cache will return based on the measured similarity of the documents. e.g.,

$ python analyze_lsh.py --doc-len 1 10 --num-tokens 10 --similarity jaccard 
|   Similarity |    LSH Count |  Total Count |     % in LSH | Theoretical % |
|--------------+--------------+--------------+--------------+--------------|
|         0.00 |            0 |       301195 |       0.0000 |       0.0000 |
|         0.10 |           62 |       130972 |       0.0005 |       0.0002 |
|         0.20 |          981 |        44804 |       0.0219 |       0.0064 |
|         0.30 |         2611 |        22244 |       0.1174 |       0.0475 |
|         0.40 |         2849 |        10834 |       0.2630 |       0.1860 |
|         0.50 |         4591 |         8126 |       0.5650 |       0.4701 |
|         0.60 |         2601 |         2958 |       0.8793 |       0.8019 |
|         0.70 |          836 |          848 |       0.9858 |       0.9748 |
|         0.80 |          771 |          772 |       0.9987 |       0.9996 |
|         0.90 |            0 |            0 |          inf |       1.0000 |
|         1.00 |            0 |            0 |          inf |       1.0000 |
|==============+==============+==============+==============+==============|
|        Total |        15302 |       522753 |       0.0293 |       0.0214 |

Roadmap

add more tests
add save() and from_file() methods
rewrite with redis backend?

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
lsh		lsh
test		test
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md
analyze_lsh.py		analyze_lsh.py
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

lsh

Installation

Usage

Tools

Roadmap

About

Releases

Packages

Languages

License

kecaps/lsh

Folders and files

Latest commit

History

Repository files navigation

lsh

Installation

Usage

Tools

Roadmap

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages