To create user
sudo adduser testuser
sudo adduser testuser sudo
For local host
ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
For other hosts
ssh-copy-id -i ~/.ssh/id_rsa.pub user@host
ssh user@host
http://hadoop.apache.org/releases.html#Download
#Choose the right mirror, below link is for US machines.
wget http://www-us.apache.org/dist/hadoop/common/hadoop-2.7.3/hadoop-2.7.3.tar.gz
tar xf hadoop-2.7.3.tar.gz --gzip
export HADOOP_HOME=$HOME/hadoop-2.7.3
Add data nodes, don't add master node.
vi $HADOOP_HOME/etc/hadoop/slaves
user@host1
user@host2
git clone https://github.com/kmadhugit/hadoop-cluster-utils.git
cd hadoop-cluster-utils
vi add-this-to-dot-profile.sh #update correct path to env variables.
. add-this-to-dot-profile.sh
check whether cluster scripts are working
AN hostname
Update .bashrc
- Delete the following check.
# If not running interactively, don't do anything
case $- in
*i*) ;;
*) return;;
esac
- Read add-this-to-dot-profile.sh at the end of .bashrc
vi $HOME/.bashrc
Gi
:r $HOME/hadoop-cluster-utils/add-this-to-dot-profile.sh
G
set -o vi
- copy .bashrc to all other data nodes
CP $HOME/.bashrc $HOME
CP $HOME/hadoop-2.7.3.tar.gz $HOME
DN "tar xf hadoop-2.7.3.tar.gz --gzip"
You need to modify 2 config files for HDFS
- core-site.xml #Modify the Hostname for the Name node
cd $HOME/hadoop-cluster-utils/conf
cp core-site.xml.template core-site.xml
vi core-site.xml
cp core-site.xml $HADOOP_HOME/etc/hadoop
CP core-site.xml $HADOOP_HOME/etc/hadoop
- hdfs-site.xml
create local dir in name node for meta-data (
mkdir -p /data/user/hdfs-meta-data
create local dir in all data-nodes for hdfs-data
DN "mkdir -p /data/user/hdfs-data"
update dir path
cd $HOME/hadoop-cluster-utils/conf
cp hdfs-site.xml.template hdfs-site.xml
vi hdfs-site.xml #update dir path
Copy the files to all nodes
cp hdfs-site.xml $HADOOP_HOME/etc/hadoop
CP hdfs-site.xml $HADOOP_HOME/etc/hadoop
- Start HDFS as fresh FS
$HADOOP_PREFIX/bin/hdfs namenode -format mycluster
start-hdfs.sh
AN jps
# use stop-hdfs.sh for stopping
- Start HDFS on existing cluster data You need to modify ownership to self to use already created data
AN "sudo chown user:user /data/hdfs-meta-data"
AN "sudo chown user:user /data/hdfs-data"
start-hdfs.sh
AN jps
Ensure that the following java process is running in master. If not, check the log files
NameNode
Ensure that the following java process is running in slaves. If not, check the log files
DataNode
- HDFS web address
http://localhost:50070
You need to modify 2 config files for HDFS
- capacity-scheduler.xml #Modify resource-calculator property to DominantResourceCalculator
vi $HADOOP_HOME/etc/hadoop/capacity-scheduler.xml
<property>
<name>yarn.scheduler.capacity.resource-calculator</name>
<value>org.apache.hadoop.yarn.util.resource.DominantResourceCalculator</value>
</property>
- yarn-site.xml # Modify the properties as per the description provided in the template
cd $HOME/hadoop-cluster-utils/conf
cp yarn-site.xml.template yarn-site.xml
vi yarn-site.xml
cp yarn-site.xml $HADOOP_HOME/etc/hadoop
CP yarn-site.xml $HADOOP_HOME/etc/hadoop
AN jps
Ensure that the following java process is started in master. If not, check the log files
JobHistoryServer
ResourceManager
Ensure that the following java process is started in slaves. If not, check the log files
NodeManager
- Start Yarn
start-yarn.sh
AN jps
- Resource Manager and Node Manager web Address
Resource Manager : http://localhost:8088/cluster
Node Manager : http://datanode:8042/node (For each node)
> stop-all.sh #stop HDFS and Yarn
> start-all.sh #start HDFS and Yarn
> CP <localpath to file> <remotepath to dir> #Copy file from name nodes to all slaves
> AN <command> #execute a given command in all nodes including master
> DN <command> #execute a given command in all nodes excluding master
http://spark.apache.org/downloads.html
#Choose the right mirror, below link is for US machines.
wget http://www-us.apache.org/dist/spark/spark-2.0.1/spark-2.0.1-bin-hadoop2.7.tgz
tar -zvf spark-2.0.1-bin-hadoop2.7.tgz
git clone https://github.com/apache/spark.git
git checkout -b v2.0.1 v2.0.1
export MAVEN_OPTS="-Xmx32G -XX:MaxPermSize=8G -XX:ReservedCodeCacheSize=2G"
./build/mvn -T40 -Pyarn -Phadoop-2.7 -Dhadoop.version=2.7.3 -Phive -Phive-thriftserver -DskipTests -Dmaven.javadoc.skip=true install
#Add in ~/.bashrc
export SPARK_HOME=$HOME/spark-2.0.1-bin-hadoop2.7
. ~/.bashrc
${SPARK_HOME}bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn-client --driver-memory 1024M --num-executors 2 --executor-memory 1g --executor-cores 1 ${SPARK_HOME}/examples/jars/spark-examples_2.11-2.0.1.jar 10
#Add in ~/.bashrc
export SPARK_HOME=$HOME/spark
. ~/.bashrc
$SPARK_HOME/bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn-client --driver-memory 1024M --num-executors 2 --executor-memory 1g --executor-cores 1 /home/testuser/spark/examples/target/scala-2.11/jars/spark-examples_2.11-2.0.1.jar
e. Enable EventLogging & additional settings by adding the following content to $SPARK_HOME/conf/spark-defaults.conf
spark.eventLog.enabled true
spark.eventLog.dir /tmp/spark-events
spark.eventLog.compress true
spark.history.fs.logDirectory /tmp/spark-events
spark.serializer org.apache.spark.serializer.KryoSerializer
The below scripts are used to start/stop the following services in an automated way,
- namenode daemon (only on hdfs master)
- datanode daemon (on all slave nodes)
- resource manager daemon (only on yarn master)
- node manager daemon (on all slave nodes)
- job history server (only on yarn master)
- Spark history server (on yarn master)
# Start
start-all.sh
# Stop
stop-all.sh
=======
Option | Description |
---|---|
--num-executors | Total number of executor JVMs to spawn across Yarn Cluster |
--executor-cores | Total number of cores in each executor JVM |
--executor-memory | Memory to be allocated for each JVM 1024M/1G |
--driver-memory | Memory to be allocated for driver JVM |
--driver-cores | Total number of vcores for driver JVM |
Total vcores = num-executors * executor-vcores + driver-cores | |
Total Memory = num-executors * executor-memory + driver-memory | |
--driver-java-options | To pass driver JVM, useful in local mode for profiling |