Skip to content

Commit

Permalink
Merge pull request piskvorky#22 from RaRe-Technologies/pypi_readme
Browse files Browse the repository at this point in the history
Fix 1.0.1rc1 version with documentation
  • Loading branch information
isamaru authored Oct 17, 2017
2 parents 8db1227 + 5c4d706 commit 7350920
Show file tree
Hide file tree
Showing 5 changed files with 48 additions and 38 deletions.
3 changes: 2 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
Changes
===========
## 1.0.0, 2017-10-17

## 1.0.1, 2017-10-17

:star2: Release version:

Expand Down
66 changes: 33 additions & 33 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -47,33 +47,33 @@ In particular, Bounter implements three different algorithms under the hood, dep

1. **[Cardinality estimation](https://en.wikipedia.org/wiki/Count-distinct_problem): "How many unique items are there?"**

```python
from bounter import bounter
```python
from bounter import bounter

counts = bounter(need_counts=False)
counts.update(['a', 'b', 'c', 'a', 'b'])
counts = bounter(need_counts=False)
counts.update(['a', 'b', 'c', 'a', 'b'])

print(counts.cardinality()) # cardinality estimation
3
print(counts.total()) # efficiently accumulates counts across all items
5
```
print(counts.cardinality()) # cardinality estimation
3
print(counts.total()) # efficiently accumulates counts across all items
5
```

This is the simplest use case and needs the least amount of memory, by using the [HyperLogLog algorithm](http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf) (built on top of Joshua Andersen's [HLL](https://github.com/ascv/HyperLogLog) code).

2. **Item frequencies: "How many times did this item appear?"**

```python
from bounter import bounter
```python
from bounter import bounter

counts = bounter(need_iteration=False, size_mb=200)
counts.update(['a', 'b', 'c', 'a', 'b'])
print(counts.total(), counts.cardinality()) # total and cardinality still work
(5L, 3L)
counts = bounter(need_iteration=False, size_mb=200)
counts.update(['a', 'b', 'c', 'a', 'b'])
print(counts.total(), counts.cardinality()) # total and cardinality still work
(5L, 3L)

print(counts['a']) # supports asking for counts of individual items
2
```
print(counts['a']) # supports asking for counts of individual items
2
```

This uses the [Count-min Sketch algorithm](https://en.wikipedia.org/wiki/Count%E2%80%93min_sketch) to estimate item counts efficiently, in a **fixed amount of memory**. See the [API docs](https://github.com/RaRe-Technologies/bounter/blob/master/bounter/bounter.py) for full details and parameters.

Expand All @@ -87,21 +87,21 @@ Such memory vs. accuracy tradeoffs are sometimes desirable in NLP, where being a

3. **Full item iteration: "What are the items and their frequencies?"**

```python
from bounter import bounter

counts = bounter(size_mb=200) # default version, unless you specify need_items or need_counts
counts.update(['a', 'b', 'c', 'a', 'b'])
print(counts.total(), counts.cardinality()) # total and cardinality still work
(5L, 3)
print(counts['a']) # individual item frequency still works
2

print(list(counts)) # iterator returns keys, just like Counter
[u'b', u'a', u'c']
print(list(counts.iteritems())) # supports iterating over key-count pairs, etc.
[(u'b', 2L), (u'a', 2L), (u'c', 1L)]
```
```python
from bounter import bounter

counts = bounter(size_mb=200) # default version, unless you specify need_items or need_counts
counts.update(['a', 'b', 'c', 'a', 'b'])
print(counts.total(), counts.cardinality()) # total and cardinality still work
(5L, 3)
print(counts['a']) # individual item frequency still works
2

print(list(counts)) # iterator returns keys, just like Counter
[u'b', u'a', u'c']
print(list(counts.iteritems())) # supports iterating over key-count pairs, etc.
[(u'b', 2L), (u'a', 2L), (u'c', 1L)]
```

Stores the keys (strings) themselves in addition to the total cardinality and individual item frequency (8 bytes). Uses the most memory, but supports the widest range of functionality.

Expand Down
2 changes: 1 addition & 1 deletion bounter/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
# This code is distributed under the terms and conditions
# from the MIT License (MIT).

__version__ = '1.0.0'
__version__ = '1.0.1rc1'

from .count_min_sketch import CountMinSketch
from bounter_htc import HT_Basic as HashTable
Expand Down
13 changes: 11 additions & 2 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,18 +7,27 @@
# This code is distributed under the terms and conditions
# from the MIT License (MIT).

import sys
import sys, os, io

if sys.version_info < (2, 7):
raise ImportError("bounter requires python >= 2.7")

# TODO add ez_setup?
from setuptools import setup, find_packages, Extension


def read(fname):
name = os.path.join(os.path.dirname(__file__), fname)
if not os.path.isfile(name):
return ''
with io.open(name, encoding='utf-8') as readfile:
return readfile.read()

setup(
name='bounter',
version='1.0.0',
version='1.0.1rc1',
description='Counter for large datasets',
long_description=read('README.rst'),

headers=['cbounter/hll.h', 'cbounter/murmur3.h'],
ext_modules=[
Expand Down
2 changes: 1 addition & 1 deletion upload.sh
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
#!/bin/bash
pandoc --from=markdown --to=rst --output=README README.md
pandoc --from=markdown --to=rst --output=README.rst README.md
python setup.py sdist upload

0 comments on commit 7350920

Please sign in to comment.