This is a quick tour of the API for the C ahocorasick module. See the full API doc for more details. The pure Python module has a slightly different interface.
The module ahocorasick
contains a few constants and the main Automaton
class.
ahocorasick.unicode
--- see `Unicode and bytes`_ahocorasick.STORE_ANY
,ahocorasick.STORE_INTS
,ahocorasick.STORE_LENGTH
--- see Automaton classahocorasick.KEY_STRING
ahocorasick.KEY_SEQUENCE
--- see Automaton classahocorasick.EMPTY
,ahocorasick.TRIE
,ahocorasick.AHOCORASICK
--- see Automaton Attributesahocorasick.MATCH_EXACT_LENGTH
,ahocorasick.MATCH_AT_MOST_PREFIX
,ahocorasick.MATCH_AT_LEAST_PREFIX
--- see description of the keys method
Note: Automaton
instances are pickle-able
meaning that you can create ahead of time an eventually large automaton then save it to disk
and re-load it later to reuse it over and over as a persistent multi-string search index.
Internally, Automaton implements the __reduce__() magic method
.
Automaton([value_type], [key_type])
Create a new empty Automaton optionally passing a value_type to indicate what is the type of associated values (default to any Python object type). It can be one ofahocorasick.STORE_ANY
,ahocorasick.STORE_INTS
orahocorasick.STORE_LENGTH
. In the last case the length of the key will be stored in the automaton. The optional argument key_type can beahocorasick.KEY_STRING
orahocorasick.KEY_SEQUENCE
. In the latter case keys will be tuples of integers. The size of integer depends on the version and platform Python is running on, but for versions of Python >= 3.3, it is guaranteed to be 32-bits.
The Automaton class has the following main trie-like methods:
add_word(key, [value]) => bool
- Add a
key
string to the dict-like trie and associate this key with avalue
. remove_word(key) => bool
- Remove a
key
string from the dict-like trie. pop(key) => value
- Remove a
key
string from the dict-like trie and return the associatedvalue
. exists(key) => bool
orkey in ...
- Return True if the key is present in the trie.
match(key) => bool
- Return True if there is a prefix (or key) equal to
key
present in the trie.
A pyahocorasick Automaton trie behaves more or less like a Python dictionary and implements a subset of dict-like methods. Some of them are:
get(key[, default])
- Return the value associated with the
key
string. Similar to dict.get(). keys([prefix, [wildcard, [how]]]) => yield strings
- Return an iterator on keys.
values([prefix, [wildcard, [how]]]) => yield object
- Return an iterator on values associated with each keys.
items([prefix, [wildcard, [how]]]) => yield tuple (string, object)
- Return an iterator on tuples of (key, value).
The methods keys
, values
and items
can be called with an optional
wildcard. A wildcard character is equivalent to a question mark used in glob
patterns (?) or a dot (.) in regular expressions. You can use any character you
like as a wildcard.
Note that it is not possible to escape a wildcard to match it exactly. You need instead to select another wildcard character not present in the provided prefix. For example:
automaton.keys("hi?", "?") # would match "him", "his" automaton.keys("XX?", "X") # would match "me?", "he?" or "it?"
The Automaton class has the following main Aho-Corasick methods:
make_automaton()
- Finalize and create the Aho-Corasick automaton.
iter(string, [start, [end]])
- Perform the Aho-Corasick search procedure using the provided input
string
. Return an iterator of tuples (end_index, value) for keys found in string. iter_long(string, [start, [end]])
- Returns iterator (object of class AutomatonSearchIterLong) that searches for longest, non-overlapping matches.
Instances of this class are returned by the iter
method of an Automaton
.
This iterator can be manipulated through its set() method.
set(string, [reset]) => None
Set a new string to search eventually keeping the current Automaton state to continue searching for the next chunk of a string.
For example:
>>> it = A.iter(b"") >>> while True: ... buffer = receive(server_address, 4096) ... if not buffer: ... break ... it.set(buffer) ... for index, value in it: ... print(index, '=>', value)
When
reset
isTrue
then processing is restarted. For example this code:>>> for string in string_set: ... for index, value in A.iter(string) ... print(index, '=>', value)
does the same job as:
>>> it = A.iter(b"") >>> for string in string_set: ... it.set(it, True) ... for index, value in it: ... print(index, '=>', value)
The Automaton class has the following attributes:
kind
[readonly]- Return the state of the
Automaton
instance. store
[readonly]- Return the type of values stored in the Automaton as specified at creation.
There is support for two method of saving and loading an automaton:
- the standard
pickle
protocol, - custom
save
andload
methods.
While pickling is more convenient to use, it has quite high memory
requirements. The save
/load
method try to overcome this
problem.
Warning
Neither format of pickle nor save are safe. Although there are a few sanity checks, they are not sufficient to detect all possible input errors.
import ahocorasick
import pickle
# build automaton
A = ahocorasick.Automaton()
# ... A.add_data, A.make_automaton
# save current state
with open(path, 'wb') as f:
pickle.dump(A, f)
# load saved state
with open(path, 'rb') as f:
B = pickle.load(f)
import ahocorasick
import pickle
# build automaton
A = ahocorasick.Automaton()
# ... A.add_data, A.make_automaton
# save current state
A.save(path, pickle.dumps)
# load saved state
B = ahocorasick.load(path, pickle.loads)
Automaton method save
requires path
to the file which will store data.
If the automaton type is STORE_ANY
, i.e. values associated with words are
any python objects, then save
requires also another argument, a callable.
The callable serializes python object into bytes; in the example above we
use standard pickle dumps
function.
Module method load
also requires path
to file that has data previously
saved. Because at the moment of loading data we don't know what is the store
attribute of automaton, the second argument - a callable - is required. The
callable must convert back given bytes object into python value, that will be
stored in automaton. Similarly, standard pickle.loads
function can be passed.
The Automaton class has a few other interesting methods:
dump() => (list of nodes, list of edges, list of fail links)
- Return a three-tuple of lists describing the Automaton as a graph of
(nodes, edges, failure links).
The source repository and source package also contains the
etc/dump2dot.py
script that convertsdump()
results to a graphviz dot format for convenient visualization of the trie and Automaton data structure. get_stats() => dict
- Return a dictionary containing Automaton statistics. Note that the real size occupied by the data structure could be larger because of internal memory fragmentation that can occur in a memory manager.
__sizeof__() => int
- Return the approximate size in bytes occupied by the Automaton instance. Also available by calling sys.getsizeof(automaton instance).
>>> import ahocorasick >>> A = ahocorasick.Automaton() >>> # add some key words to trie >>> for index, word in enumerate('he her hers she'.split()): ... A.add_word(word, (index, word)) >>> # test that these key words exists in the trie all right >>> 'he' in A True >>> 'HER' in A False >>> A.get('he') (0, 'he') >>> A.get('she') (3, 'she') >>> A.get('cat', '<not exists>') '<not exists>' >>> A.get('dog') Traceback (most recent call last): File "<stdin>", line 1, in <module> KeyError >>> A.remove_word('he') True >>> A.remove_word('he') False >>> A.pop('she') (3, 'she') >>> 'she' in A False >>> # convert the trie in an Aho-Corasick automaton >>> A = ahocorasick.Automaton() >>> for index, word in enumerate('he her hers she'.split()): ... A.add_word(word, (index, word)) >>> A.make_automaton() >>> # then find all occurrences of the stored keys in a string >>> for item in A.iter('_hershe_'): ... print(item) ... (2, (0, 'he')) (3, (1, 'her')) (4, (2, 'hers')) (6, (3, 'she')) (6, (0, 'he'))
>>> import ahocorasick >>> A = ahocorasick.Automaton() >>> # add some key words to trie >>> for index, word in enumerate('cat catastropha rat rate bat'.split()): ... A.add_word(word, (index, word)) >>> # Search some prefix >>> list(A.keys('cat')) ['cat', 'catastropha'] >>> # Search with a wildcard: here '?' is used as a wildcard. You can use any character you like. >>> list(A.keys('?at', '?', ahocorasick.MATCH_EXACT_LENGTH)) ['bat', 'cat', 'rat'] >>> list(A.keys('?at?', '?', ahocorasick.MATCH_AT_MOST_PREFIX)) ['bat', 'cat', 'rat', 'rate'] >>> list(A.keys('?at?', '?', ahocorasick.MATCH_AT_LEAST_PREFIX)) ['rate']