The aim of this project is to build a cluster of minimum of 4 machine , first we will building it virtually using virtualbox ,then on physical machines. In this project we will use hadoop 3.3.3 , ubuntu server 20.04
Click me
If ssh , sshd and java are not installed, this can be done using the following commands under Ubuntu:
$ sudo apt-get install ssh
$ sudo apt-get install rsync
$ sudo apt-get install default-jdk
Now that ssh is installed, we create a user named hadoop that will later install and run the HDFS cluster and the MapReduce jobs:
$ sudo adduser hadoop
$ sudo usermod -aG sudo hadoop
$ sudo su - hadoop
Once the user is created, we open a shell for it, create a SSH keypair for it, copy the content of the public key to the file authorized_keys and check that we can login to localhost using ssh without password:
$ ssh-keygen -t rsa -P ""
$ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
$ chmod 0600 ~/.ssh/authorized_keys
$ ssh localhost
$ exit
Since we are using ubuntu server then we will set static ip address using netplan.
$ sudo netplan generate
$ sudo vi /etc/netplan/00-installer-config.yaml
network:
ethernets:
enp0s3:
dhcp4: no
addresses: [192.168.1.100/24]
gateway4: 192.168.1.1
nameservers:
addresses: [8.8.8.8,8.8.4.4]
version: 2
then apply the change
$ sudo netplan --debug apply
Set the hostname
$ sudo vi /etc/hostname
hadoop-master # case for master node
You have to edit hosts file in /etc/ folder on all nodes, specify the IP address of each system followed by their host names.
$ sudo vi /etc/hosts
192.168.1.100 hadoop-master
192.168.1.101 hadoop-backup
192.168.1.102 hadoop-slave-1
192.168.1.103 hadoop-slave-2
192.168.1.104 hadoop-slave-3
Set the firewall
$ sudo ufw enable
$ sudo ufw status
$ sudo ufw allow ssh
$ sudo ufw allow proto tcp to 192.168.1.0/24
$ sudo ufw allow proto tcp from 192.168.1.0/24
Copy securely the ssh keys to other nodes.
$ scp ~/.ssh/authorized_keys hadoop-worker-1:~/.ssh/authorized_keys
$ scp ~/.ssh/authorized_keys hadoop-worker-2:~/.ssh/authorized_keys
$ scp ~/.ssh/authorized_keys hadoop-worker-3:~/.ssh/authorized_keys
Having setup the basic environment, we can now download the Hadoop distribution and unpack it under /usr/local/
$ wget https://dlcdn.apache.org/hadoop/common/hadoop-3.3.3/hadoop-3.3.3.tar.gz
$ shasum -a 512 hadoop-3.3.3.tar.gz
$ tar -xzvf hadoop-3.3.3.tar.gz
$ mv hadoop-3.3.3 hadoop
$ mv hadoop /usr/local/
$ sudo chown hadoop:root -R /usr/local/hadoop/
$ sudo chmod g+rwx -R /usr/local/hadoop/
Starting Hadoop commands just from the command line requires to set its environment variables , and the HDFS binaries are added to the path.
These lines can also be added to the file .bashrc
to not type them each time again.
$ vi ~/.bashrc
export HADOOP_HOME=/usr/local/hadoop
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native"
export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
$ source ~/.bashrc
Set up vi hadoop-env.sh
& vi core-site.xml
for everything.
$ sudo vi $HADOOP_HOME/etc/hadoop/hadoop-env.sh
export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
export HADOOP_CLASSPATH+=" $HADOOP_HOME/lib/*.jar"
$ sudo vi $HADOOP_HOME/etc/hadoop/core-site.xml
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
hdfs-site.xml
$ sudo vi $HADOOP_HOME/etc/hadoop/hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
$ sudo vi $HADOOP_HOME/etc/hadoop/mapred-site.xml
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapreduce.application.classpath</name>
<value>$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*:$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/*</value>
</property>
</configuration>
$ sudo vi $HADOOP_HOME/etc/hadoop/yarn-site.xml
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.env-whitelist</name>
<value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_HOME,PATH,LANG,TZ,HADOOP_MAPRED_HOME</value>
</property>
</configuration>
Format the file system "this command should be run on master node only":
$ hdfs namenode -format
Start hadoop DFS daemons ,the NameNode and DataNode "this command should be run on master node only":
$ start-dfs.sh
Browse the web interface for the NameNode; by default it is available at:
- NameNode - http://localhost:9870/
Start ResourceManager daemon and NodeManager daemon "this command should be run on master node only":
$ start-yarn.sh
Click me
### Distributed systemSet up vi hadoop-env.sh
& vi core-site.xml
for everything.
export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
export HADOOP_CLASSPATH+=" $HADOOP_HOME/lib/*.jar"
$ sudo vi $HADOOP_HOME/etc/hadoop/core-site.xml
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://hadoop-master:9000</value>
</property>
</configuration>
hdfs-site.xml
$ sudo vi $HADOOP_HOME/etc/hadoop/hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>3</value>
</property>
<!-- name -->
<property>
<name>dfs.namenode.name.dir</name>
<value>file:///home/hadoop/hdfs/name</value>
</property>
<!-- store logs -->
<property>
<name>dfs.namenode.edits.dir</name>
<value>file:///home/hadoop/hdfs/edits</value>
</property>
<!-- backup address in http : 50090 -->
<property>
<name>dfs.namenode.secondary.http-address</name>
<value>hadoop-backup:50090</value>
</property>
</configuration>
$ sudo vi $HADOOP_HOME/etc/hadoop/mapred-site.xml
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
$ sudo vi $HADOOP_HOME/etc/hadoop/yarn-site.xml
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
</configuration>
$ sudo vi $HADOOP_HOME/etc/hadoop/workers
hadoop-slave-1
hadoop-slave-2
hadoop-slave-3
$ mkdir -p /home/hadoop/hdfs/namesecondary
$ sudo vi $HADOOP_HOME/etc/hadoop/hdfs-site.xml
<configuration>
<property>
<name>dfs.namenode.checkpoint.dir</name>
<value>file://home/hadoop/hdfs/namesecondary</value>
</property>
</configuration>
In order to move from “hadoop-backup” to “hadoop-slave-1”, we need to logout from “hadoop-backup” and log into “hadoop-slave-1”.
$ exit
$ ssh hadoop-slave-1
$ mkdir -p /home/hadoop/hdfs/namesecondary
$ sudo vi $HADOOP_HOME/etc/hadoop/hdfs-site.xml
<configuration>
<property>
<name>dfs.datanode.data.dir</name>
<value>file://home/hadoop/hdfs/datanode</value>
</property>
</configuration>
Repeat for other slaves (Workers)
$ exit
$ hdfs namenode -format