CS553 - Cloud Computing - Illinois Institute of Technology

TeraSort on Hadoop/Spark

###Hadoop Steps

Upload your pem file on AWS instance. Do the following procedure to give access.

eval ssh-agent -s chmod 600 Hadoop.pem ssh-add Hadoop.pem
Mount additional drives (put Mount.sh script here):

bash Mount.sh

3.Format the Hadoop namenode. hadoop namenode -format

4.Start all the services from the hadoop/sbin: ./hadoop/sbin/start-dfs.sh ./hadoop/sbin/start-yarn.sh

-> In case nodemanger won't start, start it manually: ./hadoop/sbin/yarn-deamon.sh start nodemanager

-> Check on master and slave both for the active running services: jps

-> Check online for total running datanodes on slave and status: MASTER_PUBLIC_DNS:50070

Write your Java File and make jar file, add class file to jar. hadoop com.sun.tools.javac.Main YOUR_FILE_NAME jar cf JAR_NAME.jar YOUR_FILE_NAME*.class
Generate data file with gensort. ./gensort -a LINES file_path/file_name
Put generated data on hadoop file system. hadoop fs -mkdir /user hadoop fs -mkdir /user/ec2-user hadoop fs -mkdir /user/ec2-user/input Hadoop fs -put file_path/file_name hadoop_input_path
Run the Program: hadoop jar ts.jar Terasort input_path output_path
Store output from hadoop fs to local fs and validate it: hadoop fs -get output/ /PATH* mv generated_outputname output_file_name
convert output file with following command: sudo yum install unix2dos ( to install application in System) unix2dos output_file_name
validate it with valsort: ./valsort output_file_name

###Spark Steps

by default drive is mounted on /media/ephermal0 which I prefered to /mnt/raid , so Did following changes: Put Mount.sh File here and run this

bash Mount.sh
Put Python Program in same folder as Pyspark-shell.
Slave nodes are come with drive mounted, to run program go to following path:

cd spark/ ./pyspark
Type "Python program_name.py" ; it will generate output.
To checck the output data:

./hadoop fs -ls /user/root/output/
To fetch data from output folder, run following commands:

./hadoop fs -get /user/root/output/File_number /mnt/raid/

-> It will fetch data from hadoop file system and store it into drive we mounted at /mnt/raid

###Shared Memory/External Memory

Data Set Generation: Generate the data with gensort at location /mnt/raid with file name "input"

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
Code		Code
Config_Files		Config_Files
Output		Output
Report		Report
Scripts		Scripts
README.md		README.md
ReadME.txt		ReadME.txt