CS267-Project

Running K-means on Hadoop Cluster

Make Sure that the Hadoop server is running

Preparing Input Directories

hadoop fs -mkdir /mahout_data
hadoop fs -mkdir /kmeans_output
hadoop fs -mkdir /mahout_seq

Verify the created directories using:

hadoop fs -ls

Copy the Input file to HDFS

hadoop fs -put ./keyVal.txt /mahout_data/

Note: sample.txt file is in hadoop/data/ directory of this repo.

Converting the txt file to Sequence file

mahout seqdirectory \
-i /mahout_data \
-o /mahout_seq \
-ow

Seq to sparse convertion

mahout seq2sparse -i /mahout_seq/ -o /mahout_sparse/ -ow

Perform Canopy Clustering

mahout canopy -i /mahout_sparse/tf-vectors -o /canopy_output/ \
-dm org.apache.mahout.common.distance.EuclideanDistanceMeasure -t1 10 -t2 20 -ow

Finally run kmeans using output of Canopy for initial clusters

mahout kmeans -i /mahout_sparse/tfidf-vectors \
-c /canopy_output  \
-o /kmeans_output  \
-dm org.apache.mahout.common.distance.EuclideanDistanceMeasure -x 2 -k 30 -ow

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
hadoop		hadoop
scripts		scripts
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CS267-Project

Running K-means on Hadoop Cluster

Preparing Input Directories

Copy the Input file to HDFS

Converting the txt file to Sequence file

Seq to sparse convertion

Perform Canopy Clustering

Finally run kmeans using output of Canopy for initial clusters

About

Releases

Packages

Languages

sidkuma24/MapReduce

Folders and files

Latest commit

History

Repository files navigation

CS267-Project

Running K-means on Hadoop Cluster

Preparing Input Directories

Copy the Input file to HDFS

Converting the txt file to Sequence file

Seq to sparse convertion

Perform Canopy Clustering

Finally run kmeans using output of Canopy for initial clusters

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages