Skip to content

Commit 4ddf0d0

Browse files
committed
update README
1 parent 35ae75a commit 4ddf0d0

File tree

1 file changed

+18
-13
lines changed

1 file changed

+18
-13
lines changed

README

+18-13
Original file line numberDiff line numberDiff line change
@@ -1,43 +1,46 @@
11
Overview
22

3-
qfs_mapred is a streaming map reduce implementation using qfs and gearman.
3+
=============
44

5+
qfs_mapred is a simple streaming map reduce implementation using qfs and gearman.
6+
7+
-------
58
Qfs – Quantcast File System : Distributed, fault tolerant file system based on GFS, wiritten in C++. Swap out replacement for Hadoop;s HDFS.
69

710
Gearman – Job Queue framework used to send generic task to workers. Basically a framework for forking application code to other computers.
811

9-
Motivation:
10-
11-
After getting my first RaspberryPi (rpi), I wanted to build a 'supercomputer' out of these by creating a large cluster of rpi and have them 'crunch' data. Initially I wanted to run something like Hadoop, however, since the rpi don't have enough RAM (512mb) to run Java I went looking for a lightweight alternative to Hadoop's HDFS and MapReduce. I found qfs for the HDFS replacement, this has a much smaller memory footprint than since it is in C++. Beyond that I was familiar with using gearman as a distributed job queuing framework. The goal of this project was to bring qfs and gearman together to allow for a full replacement of Hadoop that would run on a cluster of low powered computers such as the rpi.
12-
13-
14-
Implementation of Map Reduce Architecture:
12+
Implementation of Map Reduce Architecture
13+
----------
1514

1615
The approach was to keep the implementation as simple as possible, therefor streaming python is used for the user provided map and reduce functions. This allows for easy debugging and portability.
1716

1817
Here are the 3 steps:
1918

20-
Mapping:
19+
* Mapping:
2120

2221
Input to a map reduce job is a folder of input text files that is stored in qfs. Each file is read into a mapper worker, so there is not automatic splitting of data. Simply split up the input files as needed. Ideally you would have one file per mapper worker slot, larger the file size the better. The output of the mapper goes back into qfs into a set of partitioned files. Since multiple mappers may be streaming data to the same output partitions, qfs has a special record appender that guarantees that atomic writes in the qfs.
2322

24-
Simplified mapper worker system command:cCpfromqfs -k /input0 | mapper.py | mapper_to_qfs_partitions
23+
Simplified mapper worker system command:
24+
Cpfromqfs -k /input0 | mapper.py | mapper_to_qfs_partitions
2525

2626
mapper_to_qfs_partitions tool also allows for bulk record appending by spilling data for each partition after a specified amount of bytes have been produced by the mappers.
2727

28-
Sorting:
28+
* Sorting:
2929

3030
The map reduce framework requires the intermediate data, mapper output, to be sorted by key for the reducers. The implementation of this step is very different that Hadoop's shuffle/sort algorithm. Each partition is read by a sorter worker from qfs into system memory, then sorted and write back to qfs. This has several trade offs, first the size of a partition cannot be larger that the system memory or system pageing will occur (swapping). However, the implementation is very simple, also both reading & writing large chunks from qfs is efficient which will improve the sort's comparison times because the sorting occurs in memory, not in qfs.
31-
Simplified sorter worker system command: cpfromqfs -k /mapper_out | kvsorter | cptoqfs -k /mapper_out_sorted
31+
Simplified sorter worker system command:
32+
cpfromqfs -k /mapper_out | kvsorter | cptoqfs -k /mapper_out_sorted
3233

3334

34-
Reducing:
35+
* Reducing:
3536
Each reducer worker is sent the file list of the sorted intermediate that is ready for reducing. The output of each reducer is sent back to qfs into its own file. Once all reducers are finished the outputs can be concatenated back into qfs to produce an unsorted result of the map reduce job.
3637

37-
Simplifed reducer worker system comand: cpfromqfs -k /mapper_out_sorted | reducer.py | cptoqfs -k /output0
38+
Simplifed reducer worker system comand:
39+
cpfromqfs -k /mapper_out_sorted | reducer.py | cptoqfs -k /output0
3840

3941

4042
Getting Started
43+
---------------------
4144

4245
Requirements
4346

@@ -52,4 +55,6 @@ libboost_program_options
5255
libboost_thread
5356
libgearman
5457

58+
Quick Start
59+
----------------------
5560

0 commit comments

Comments
 (0)