GitHub - gautham-gn/hadoop-streaming-python: Hadoop Streaming for Dataset Analysis

Hadoop Streaming using Python

The task is on running a MapReduce program using Python in Hadoop Cluster. Follow the below mentioned steps to perform Hadoop streaming.

Login to Hadoop Cluster using your credentials.
Run Command
git clone https://github.uc.edu/gondinm/hadoop-streaming.git
The above commands creates a directory in file system.
Go inside the directory by using change directory.
cd hadoop-streaming
You will see mapper and reducer scripts inside the directory.
Now, run the following command to run map reduce on new york city traffic accidents data.
hadoop jar /usr/hdp/2.6.3.0-235/hadoop-mapreduce/hadoop-streaming-2.7.3.2.6.3.0-235.jar -file mapper.py -mapper mapper.py -file reducer.py -reducer reducer.py -input /data/nyc/nyc-traffic.csv -output /user/gondinm/pyOut/
The log for running above code and the output file has been pushed into this repository as RunLog.txt and Outputfile.
Now, once the command gets executed it creates a pyOut directory in the hadoop cluster. Copy it to local directory using:
hadoop fs -get /user/gondinm/pyOut /home/gondinm/pyOut
Go inside the pyOut directory and check part_0000 file to view the output of the code.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.DS_Store		.DS_Store
.gitattributes		.gitattributes
OutputFile		OutputFile
README.md		README.md
RunLog.txt		RunLog.txt
mapper.py		mapper.py
reducer.py		reducer.py

Provide feedback