Skip to content

Commit

Permalink
extra indent
Browse files Browse the repository at this point in the history
  • Loading branch information
isamaru committed Oct 18, 2017
1 parent f6dc86b commit 9e755ba
Show file tree
Hide file tree
Showing 2 changed files with 267 additions and 11 deletions.
22 changes: 11 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -132,17 +132,17 @@ To test the accuracy of Bounter, we automatically extracted [collocations](https

We compared the set of collocations extracted from Counter (exact counts, needs lots of memory) vs Bounter (approximate counts, bounded memory) and present the precision and recall here:

| Algorithm | Time to build | Memory | Precision | Recall | F1 score
|-----------------------------------|--------------:|--------:|----------:|-------:|---------:|
| `Counter` (built-in) | 32m 26s | 31 GB | 100% | 100% | 100% |
| `bounter(size_mb=128, need_iteration=False, log_counting=8)` | 19m 53s | **128 MB** | 95.02% | 97.10% | 96.04% |
| `bounter(size_mb=1024)` | 17m 54s | 1 GB | 100% | 99.27% | 99.64% |
| `bounter(size_mb=1024, need_iteration=False)` | 19m 58s | 1 GB | 99.64% | 100% | 99.82% |
| `bounter(size_mb=1024, need_iteration=False, log_counting=1024)` | 20m 05s | 1 GB | **100%** | **100%** | **100%** |
| `bounter(size_mb=1024, need_iteration=False, log_counting=8)` | 19m 59s | 1 GB | 97.45% | 97.45% | 97.45% |
| `bounter(size_mb=4096)` | **16m 21s** | 4 GB | 100% | 100% | 100% |
| `bounter(size_mb=4096, need_iteration=False)` | 20m 14s | 4 GB| 100% | 100% | 100% |
| `bounter(size_mb=4096, need_iteration=False, log_counting=1024)` | 20m 14s | 4 GB | 100% | 99.64% | 99.82% |
| Algorithm | Time to build | Memory | Precision | Recall | F1 score |
|------------------------------------------------------------------|--------------:|-----------:|----------:|---------:|---------:|
| `Counter` (built-in) | 32m 26s | 31 GB | 100% | 100% | 100% |
| `bounter(size_mb=128, need_iteration=False, log_counting=8)` | 19m 53s | **128 MB** | 95.02% | 97.10% | 96.04% |
| `bounter(size_mb=1024)` | 17m 54s | 1 GB | 100% | 99.27% | 99.64% |
| `bounter(size_mb=1024, need_iteration=False)` | 19m 58s | 1 GB | 99.64% | 100% | 99.82% |
| `bounter(size_mb=1024, need_iteration=False, log_counting=1024)` | 20m 05s | 1 GB | **100%** | **100%** | **100%** |
| `bounter(size_mb=1024, need_iteration=False, log_counting=8)` | 19m 59s | 1 GB | 97.45% | 97.45% | 97.45% |
| `bounter(size_mb=4096)` | **16m 21s** | 4 GB | 100% | 100% | 100% |
| `bounter(size_mb=4096, need_iteration=False)` | 20m 14s | 4 GB | 100% | 100% | 100% |
| `bounter(size_mb=4096, need_iteration=False, log_counting=1024)` | 20m 14s | 4 GB | 100% | 99.64% | 99.82% |

Bounter achieves a perfect F1 score of 100% at 31x less memory (1GB vs 31GB), compared to a built-in `Counter` or `dict`. It is also 61% faster.

Expand Down
256 changes: 256 additions & 0 deletions README.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,256 @@
Bounter -- Counter for large datasets
=====================================

|Build Status|\ |GitHub release|\ |Mailing List|\ |Gitter|\ |Follow|

Bounter is a Python library, written in C, for extremely fast
probabilistic counting of item frequencies in massive datasets, using
only a small fixed memory footprint.

Why Bounter?
------------

Bounter lets you count how many times an item appears, similar to
Python's built-in ``dict`` or ``Counter``:

.. code:: python
from bounter import bounter
counts = bounter(size_mb=1024) # use at most 1 GB of RAM
counts.update([u'a', 'few', u'words', u'a', u'few', u'times']) # count item frequencies
print(counts[u'few']) # query the counts
2
However, unlike ``dict`` or ``Counter``, Bounter can process huge
collections where the items would not even fit in RAM. This commonly
happens in Machine Learning and NLP, with tasks like **dictionary
building** or **collocation detection** that need to estimate counts of
billions of items (token ngrams) for their statistical scoring and
subsequent filtering.

Bounter implements approximative algorithms using optimized low-level C
structures, to avoid the overhead of Python objects. It lets you specify
the maximum amount of RAM you want to use. In the Wikipedia example
below, Bounter uses 31x less memory compared to ``Counter``.

Bounter is also marginally faster than the built-in ``dict`` and
``Counter``, so wherever you can represent your **items as strings**
(both byte-strings and unicode are fine, and Bounter works in both
Python2 and Python3), there's no reason not to use Bounter instead.

Installation
------------

Bounter has no dependencies beyond Python >= 2.7 or Python >= 3.3 and a
C compiler:

.. code:: bash
pip install bounter # install from PyPI
Or, if you prefer to install from the `source
tar.gz <https://pypi.python.org/pypi/bounter>`__:

.. code:: bash
python setup.py test # run unit tests
python setup.py install
How does it work?
-----------------

No magic, just some clever use of approximative algorithms and solid
engineering.

In particular, Bounter implements three different algorithms under the
hood, depending on what type of "counting" you need:

1. **`Cardinality
estimation <https://en.wikipedia.org/wiki/Count-distinct_problem>`__:
"How many unique items are there?"**

.. code:: python
from bounter import bounter
counts = bounter(need_counts=False)
counts.update(['a', 'b', 'c', 'a', 'b'])
print(counts.cardinality()) # cardinality estimation
3
print(counts.total()) # efficiently accumulates counts across all items
5
This is the simplest use case and needs the least amount of memory, by
using the `HyperLogLog
algorithm <http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf>`__
(built on top of Joshua Andersen's
`HLL <https://github.com/ascv/HyperLogLog>`__ code).

2. **Item frequencies: "How many times did this item appear?"**

.. code:: python
from bounter import bounter
counts = bounter(need_iteration=False, size_mb=200)
counts.update(['a', 'b', 'c', 'a', 'b'])
print(counts.total(), counts.cardinality()) # total and cardinality still work
(5L, 3L)
print(counts['a']) # supports asking for counts of individual items
2
This uses the `Count-min Sketch
algorithm <https://en.wikipedia.org/wiki/Count%E2%80%93min_sketch>`__ to
estimate item counts efficiently, in a **fixed amount of memory**. See
the `API
docs <https://github.com/RaRe-Technologies/bounter/blob/master/bounter/bounter.py>`__
for full details and parameters.

As a further optimization, Count-min Sketch optionally support a
`logarithmic probabilistic
counter <https://en.wikipedia.org/wiki/Approximate_counting_algorithm>`__:

- ``bounter(need_iteration=False)``: default option. Exact counter, no
probabilistic counting. Occupies 4 bytes (max value 2^32) per bucket.
- ``bounter(need_iteration=False, log_counting=1024)``: an integer
counter that occupies 2 bytes. Values up to 2048 are exact; larger
values are off by +/- 2%. The maximum representable value is around
2^71.
- ``bounter(need_iteration=False, log_counting=8)``: a more aggressive
probabilistic counter that fits into just 1 byte. Values up to 8 are
exact and larger values can be off by +/- 30%. The maximum
representable value is about 2^33.

Such memory vs. accuracy tradeoffs are sometimes desirable in NLP, where
being able to handle very large collections is more important than
whether an event occurs exactly 55,482x or 55,519x.

3. **Full item iteration: "What are the items and their frequencies?"**

.. code:: python
from bounter import bounter
counts = bounter(size_mb=200) # default version, unless you specify need_items or need_counts
counts.update(['a', 'b', 'c', 'a', 'b'])
print(counts.total(), counts.cardinality()) # total and cardinality still work
(5L, 3)
print(counts['a']) # individual item frequency still works
2
print(list(counts)) # iterator returns keys, just like Counter
[u'b', u'a', u'c']
print(list(counts.iteritems())) # supports iterating over key-count pairs, etc.
[(u'b', 2L), (u'a', 2L), (u'c', 1L)]
Stores the keys (strings) themselves in addition to the total
cardinality and individual item frequency (8 bytes). Uses the most
memory, but supports the widest range of functionality.

This option uses a custom C hash table underneath, with optimized string
storage. It will remove its low-count objects when nearing the maximum
alotted memory, instead of expanding the table.

--------------

For more details, see the `API
docstrings <https://github.com/RaRe-Technologies/bounter/blob/master/bounter/bounter.py>`__.

Example on the English Wikipedia
--------------------------------

Let's count the frequencies of all bigrams in the English Wikipedia
corpus:

.. code:: python
with smart_open('wikipedia_tokens.txt.gz') as wiki:
for line in wiki:
words = line.decode().split()
bigrams = zip(words, words[1:])
counter.update(u' '.join(pair) for pair in bigrams)
print(counter[u'czech republic'])
42099
The Wikipedia dataset contained 7,661,318 distinct words across
1,860,927,726 total words, and 179,413,989 distinct bigrams across
1,857,420,106 total bigrams. Storing them in a naive built-in ``dict``
would consume over 31 GB RAM.

To test the accuracy of Bounter, we automatically extracted
`collocations <https://en.wikipedia.org/wiki/Collocation>`__ (common
multi-word expressions, such as "New York", "network license", "Supreme
Court" or "elementary school") from these bigram counts.

We compared the set of collocations extracted from Counter (exact
counts, needs lots of memory) vs Bounter (approximate counts, bounded
memory) and present the precision and recall here:

+----------------------------------------------+----------+---------+-----------+----------+----------+
| Algorithm | Time to | Memory | Precision | Recall | F1 score |
| | build | | | | |
+==============================================+==========+=========+===========+==========+==========+
| ``Counter`` (built-in) | 32m 26s | 31 GB | 100% | 100% | 100% |
+----------------------------------------------+----------+---------+-----------+----------+----------+
| ``bounter(size_mb=128, need_iteration=False, | 19m 53s | **128 | 95.02% | 97.10% | 96.04% |
| log_counting=8)`` | | MB** | | | |
+----------------------------------------------+----------+---------+-----------+----------+----------+
| ``bounter(size_mb=1024)`` | 17m 54s | 1 GB | 100% | 99.27% | 99.64% |
+----------------------------------------------+----------+---------+-----------+----------+----------+
| ``bounter(size_mb=1024, | 19m 58s | 1 GB | 99.64% | 100% | 99.82% |
| need_iteration=False)`` | | | | | |
+----------------------------------------------+----------+---------+-----------+----------+----------+
| ``bounter(size_mb=1024, | 20m 05s | 1 GB | **100%** | **100%** | **100%** |
| need_iteration=False, log_counting=1024)`` | | | | | |
+----------------------------------------------+----------+---------+-----------+----------+----------+
| ``bounter(size_mb=1024, | 19m 59s | 1 GB | 97.45% | 97.45% | 97.45% |
| need_iteration=False, log_counting=8)`` | | | | | |
+----------------------------------------------+----------+---------+-----------+----------+----------+
| ``bounter(size_mb=4096)`` | **16m | 4 GB | 100% | 100% | 100% |
| | 21s** | | | | |
+----------------------------------------------+----------+---------+-----------+----------+----------+
| ``bounter(size_mb=4096, | 20m 14s | 4 GB | 100% | 100% | 100% |
| need_iteration=False)`` | | | | | |
+----------------------------------------------+----------+---------+-----------+----------+----------+
| ``bounter(size_mb=4096, | 20m 14s | 4 GB | 100% | 99.64% | 99.82% |
| need_iteration=False, log_counting=1024)`` | | | | | |
+----------------------------------------------+----------+---------+-----------+----------+----------+

Bounter achieves a perfect F1 score of 100% at 31x less memory (1GB vs
31GB), compared to a built-in ``Counter`` or ``dict``. It is also 61%
faster.

Even with just 128 MB (250x less memory), its F1 score is still 96.04%.

Support
=======

Use `Github
issues <https://github.com/RaRe-Technologies/bounter/issues>`__ to
report bugs, and our `mailing
list <https://groups.google.com/forum/#!forum/gensim>`__ for general
discussion and feature ideas.

--------------

``Bounter`` is open source software released under the `MIT
license <https://github.com/rare-technologies/bounter/blob/master/LICENSE>`__.

Copyright (c) 2017 `RaRe
Technologies <https://rare-technologies.com/>`__

.. |Build Status| image:: https://travis-ci.org/RaRe-Technologies/bounter.svg?branch=master
:target: https://travis-ci.org/RaRe-Technologies/bounter
.. |GitHub release| image:: https://img.shields.io/github/release/rare-technologies/bounter.svg?maxAge=3600
:target: https://github.com/RaRe-Technologies/bounter/releases
.. |Mailing List| image:: https://img.shields.io/badge/-Mailing%20List-lightgrey.svg
:target: https://groups.google.com/forum/#!forum/gensim
.. |Gitter| image:: https://img.shields.io/badge/gitter-join%20chat%20%E2%86%92-09a3d5.svg
:target: https://gitter.im/RaRe-Technologies/gensim
.. |Follow| image:: https://img.shields.io/twitter/follow/spacy_io.svg?style=social&label=Follow
:target: https://twitter.com/gensim_py

0 comments on commit 9e755ba

Please sign in to comment.