Skip to content

Latest commit

 

History

History

titanic_data_analysis

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 

There have been as many inquisitions as there have been questions raised and equally that many types of analysis methods applied to arrive at conclusions. But this project is not about analyzing why or what made the Titanic sink – it is about analyzing the data that is present about the Titanic publicly. It actually uses Pig MapReduce to analyze and arrive at:

In this project I will find out the following result from titanic data.

  • The average age of the people (both male and female) who died in the tragedy using Pig MapReduce.

  • How many persons survived – traveling class wise.

DATA SET DESCRIPTION

Column 1 : PassengerId

Column 2 : Survived (survived=0 & died=1)

Column 3 : Pclass

Column 4 : Name

Column 5 : Sex

Column 6 : Age

Column 7 : SibSp

Column 8 : Parch

Column 9 : Ticket

Column 10 : Fare

Column 11 : Cabin

Column 12 : Embarked

The titanic data set have been given in this repository in titanicdata file. Store the data in HDFS in mentiond directory in pig script file. and the execute the pig script file. Run the following command to count and store the average age of survived and died passanger.

$ pig -x mapreduce '/home/maniram/data/titanic_avg_age_report.pig' 
-- make sure have changed the file path according to your local system in above line

To find out the dead and servived passanger report run the second pig script file using following command

$ pig -x mapreduce '/home/maniram/data/titanic_pclass_report.pig'
-- make sure have changed the file path according to your local system in above line.

After executing above Pis script file your result be saved in HDFS as well it the result will be shown in shell terminal also.

Make sure your hadoop cluster is running before executing the pig script file.