A simple framework for MapReduce processing on a single machine.
From command line:
cd sample
cat corpus.txt | python map.py | python shuffle.py | python reduce.py
def mapper(stream, map_function, emit=emit_console)
Argument | Description |
---|---|
stream | Iterable input stream. |
map_function | Function called to map each input stream entry. |
emit | Function called to output map_function result. Default emit_console outputs to console. |
def map_function(line, emit)
Argument | Description |
---|---|
line | Line to process. |
emit | Function called to output result. |
def emit(key, value=None)
Argument | Description |
---|---|
key | Key. |
value | Value associated to key. |
When emitted, key and value must be separated by common.COLUMN_SEPARATOR
. common.format_key_value
can be used to do so.
def reducer(stream, reduce_function, emit=emit_console)
Argument | Description |
---|---|
stream | Input stream. Must be iterable. |
reduce_function | Function called to reduce each input stream entry. |
emit | Function called to output reduce_function result. |
def reduce(key, values)
Argument | Description |
---|---|
key | Key. |
value | Array of values associated to key. |