Introductions & Overview

We will set up the cluster using Cloudera Manager
Note: We need a 64-bit machine for Cloudera cluster set up.

Selecting Hardware for Your CDH Cluster
Linux configuration/prechecks
Install package repositories for Cloudera Manager and CDH
Install a MySQL server for CM
Install Cloudera Manager and CDH
Benchmarking
Kerberize the cluster

Selecting Hardware for Your CDH Cluster

Here are the recommended specifications for DataNode/TaskTrackers in a balanced Hadoop cluster:

12-24 1-4TB hard disks in a JBOD (Just a Bunch Of Disks) configuration
2 quad-/hex-/octo-core CPUs, running at least 2-2.5GHz
64-512GB of RAM
Bonded Gigabit Ethernet or 10Gigabit Ethernet (the more storage density, the higher the network throughput needed)

Here are the recommended specifications for NameNode/JobTracker/Standby NameNode nodes. The drive count will fluctuate depending on the amount of redundancy:

4–6 1TB hard disks in a JBOD configuration (1 for the OS, 2 for the FS image [RAID 1], 1 for Apache ZooKeeper, and 1 for Journal node)
2 quad-/hex-/octo-core CPUs, running at least 2-2.5GHz
64-128GB of RAM
Bonded Gigabit Ethernet or 10Gigabit Ethernet

Linux configuration/prechecks

Before cluster set up, we need to configure our nodes. Follow the below steps in all nodes.

Step 1: Hostname Resolution, DNS and FQDNs

Set hostname

# vim /etc/sysconfig/network
NETWORKING=yes
HOSTNAME=ip-172-31-250-81.cn-north-1.compute.internal
NETWORKING_IPV6=no
NOZEROCONF=yes

If you do use /etc/hosts, ensure that you are listing them in the appropriate order.
- The FQDN must be listed first
- The IP address 127.0.0.1 must resolve to localhost

127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6
172.31.250.81 ip-172-31-250-81.cn-north-1.compute.internal ip-172-31-250-81
172.31.250.119 ip-172-31-250-119.cn-north-1.compute.internal ip-172-31-250-119
172.31.250.120 ip-172-31-250-120.cn-north-1.compute.internal ip-172-31-250-120
172.31.250.121 ip-172-31-250-121.cn-north-1.compute.internal ip-172-31-250-121

Test proper resolution

# python -c 'import socket; print socket.getfqdn(), socket.gethostbyname(socket.getfqdn())'

Enable the name server cache daemon (nscd) service

Step 2: Sync all the nodes with a time source using NTP (Network Time Protocol)

follow the steps documented here.

Step 3: Make one user as sudo user, to be used later for SSH Ex: hypers

# vim /etc/sudoers
#Add the below line:
dummyuser ALL=(ALL) NOPASSWD:ALL

Step 4: Set IPTables to off

# /etc/init.d/iptables save
# /etc/init.d/iptables stop
# chkconfig iptables off

Step 5: Set IPv6 to disabled

# vim /etc/sysctl.conf
#Disable IPv6
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1

# vim /etc/sysconfig/network-scripts/ifcfg-eth0
NETWORKING_IPV6=no
IPV6INIT=no

Step 6: Set SELinux to disabled

# setenforce 0
# sed -i s@enforcing@disabled@g /etc/selinux/config

Step 7: Set swappiness (vm.swappiness) to 0

# sysctl vm.swappiness=0
# echo "vm_swappiness = 0" >> /etc/sysctl.conf

Step 8: Set Transparent Huge Pages (THP) to off

# echo never > /sys/kernel/mm/redhat_transparent_hugepage/defrag
# echo never > /sys/kernel/mm/redhat_transparent_hugepage/enabled
# echo never > /sys/kernel/mm/transparent_hugepage/enabled
# echo never > /sys/kernel/mm/transparent_hugepage/defrag
# echo "echo never > /sys/kernel/mm/redhat_transparent_hugepage/defrag" >> /etc/rc.local
# echo "echo never > /sys/kernel/mm/redhat_transparent_hugepage/enabled" >> /etc/rc.local
# echo "echo never > /sys/kernel/mm/transparent_hugepage/enabled" >> /etc/rc.local
# echo "echo never > /sys/kernel/mm/transparent_hugepage/defrag" >> /etc/rc.local

Step 8: Raise the global limits to 64k

# vim /etc/security/limits.conf
#Add the below lines:
*   soft    nofile 655350
*   hard    nofile 655350

Step 9: Set noatime on your supplementary volumes

Set it via mount option in /etc/fstab

# vim /etc/fstab
/dev/sdb1 /data1    ext4    defaults,noatime       0 0

Step 10: Set the reserve space for your supplementary volumes to 0

Set it via mount option in /etc/fstab

# vim /etc/fstab
/dev/sdb1 /data1    ext4    defaults,noatime       0 0

Step 11: (Optional) If you are doing a lot of streaming, set vm.overcommit_memory kernel parameter to 1

# sysctl vm.overcommit_memory=1
# echo "vm.overcommit_memory = 1" >> /etc/sysctl.conf

Step 12: Restart the network

# /etc/init.d/network restart

Install package repositories for Cloudera Manager and CDH

Step 1: Installing and Starting Apache HTTPD

# yum install httpd -y
# chkconfig httpd on
# service httpd start

Step 2: Download Tarball of CM

# cd /var/www/html/
# wget https://s3.cn-north-1.amazonaws.com.cn/hypers/cdh/CDH-5.5.0/cm5.5.0-centos6.tar.gz
# tar xzvf cm5.5.0-centos6.tar.gz
# chmod -R ugo+rX /var/www/html/cm
# rm -rf cm5.5.0-centos6.tar.gz

Step 3: Download Parcel of CDH

# cd /var/www/html/
# mkdir CDH-5.5.0
# cd CDH-5.5.0/
# wget https://s3.cn-north-1.amazonaws.com.cn/hypers/cdh/CDH-5.5.0/CDH-5.5.0-1.cdh5.5.0.p0.8-el6.parcel
# wget https://s3.cn-north-1.amazonaws.com.cn/hypers/cdh/CDH-5.5.0/CDH-5.5.0-1.cdh5.5.0.p0.8-el6.parcel.sha1
# wget https://s3.cn-north-1.amazonaws.com.cn/hypers/cdh/CDH-5.5.0/manifest.json

Step 4: Download Parcel of KAFKA

# cd /var/www/html/
# mkdir KAFKA
# cd KAFKA/
# wget https://s3.cn-north-1.amazonaws.com.cn/hypers/cdh/Kafka/KAFKA-0.8.2.0-1.kafka1.3.2.p0.15-el6.parcel
# wget https://s3.cn-north-1.amazonaws.com.cn/hypers/cdh/Kafka/KAFKA-0.8.2.0-1.kafka1.3.2.p0.15-el6.parcel.sha1
# wget https://s3.cn-north-1.amazonaws.com.cn/hypers/cdh/Kafka/manifest.json

Step 5: Enable all nodes to find the packages that you are hosting (Follow the below steps in all nodes)

replace ip-172-31-250-81.cn-north-1.compute.internal with your local repository's hostname

echo "[cloudera-manager]" > /etc/yum.repos.d/cloudera-manager.repo
echo "# Packages for Cloudera Manager, Version 5, on RedHat or CentOS 6 x86_64" >> /etc/yum.repos.d/cloudera-manager.repo
echo "name=Cloudera Manager" >> /etc/yum.repos.d/cloudera-manager.repo
echo "baseurl = http://ip-172-31-250-81.cn-north-1.compute.internal/cm/5/" >> /etc/yum.repos.d/cloudera-manager.repo
echo "gpgkey = http://ip-172-31-250-81.cn-north-1.compute.internal/cm/RPM-GPG-KEY-cloudera" >> /etc/yum.repos.d/cloudera-manager.repo
echo "gpgcheck = 1" >> /etc/yum.repos.d/cloudera-manager.repo

Install a MySQL server for CM

follow the steps documented here

Install Cloudera Manager and CDH

Step 1: Set up a Database for the Cloudera Manager Server

mysql -uroot --password='gurutechhypers' -h cdh01.hypers.com.cn
    GRANT ALL PRIVILEGES ON scm.* to 'scm'@'%' IDENTIFIED BY 'xRoYuK8ajV';
    flush privileges;
    exit;

Step 2: Set up an external database and pre-create the schemas needed for your deployment

mysql> create database database DEFAULT CHARACTER SET utf8;
Query OK, 1 row affected (0.00 sec)

mysql> grant all on database.* TO 'user'@'%' IDENTIFIED BY 'password';
Query OK, 0 rows affected (0.00 sec)

database, user, and password can be any value. The examples match the default names provided in the Cloudera Manager configuration settings:

Role	Database	User	Password
Activity Monitor	amon	amon	amon_password
Reports Manager	rman	rman	rman_password
Hive Metastore Server	metastore	hive	hive_password
Sentry Server	sentry	sentry	sentry_password
Cloudera Navigator Audit Server	nav	nav	nav_password
Cloudera Navigator Metadata Server	navms	navms	navms_password

Step 3: Install the Oracle JDK

Install the Oracle Java Development Kit (JDK) on the Cloudera Manager Server host

# yum install oracle-j2sdk1.7 -y

Step 4: Install the Cloudera Manager Server Packages

On the Cloudera Manager Server host, type the following commands to install the Cloudera Manager packages

# yum install cloudera-manager-daemons cloudera-manager-server -y

Step 4: Set up a Database for the Cloudera Manager Server

Running the script when MySQL is installed on another host
This example explains how to run the script on the Cloudera Manager Server host (myhost2) and create and use a temporary MySQL user account to connect to MySQL remotely on the MySQL host (myhost1)
On the Cloudera Manager Server host (myhost2), run the script
/usr/share/cmf/schema/scm_prepare_database.sh mysql -h cdh01.hypers.com.cn -uroot -pgurutechhypers --scm-host cdh01.hypers.com.cn scm scm xRoYuK8ajV

Step 5: Start the Cloudera Manager Server

# chkconfig cloudera-scm-server on
# service cloudera-scm-server start

Step 6: Start and Log into the Cloudera Manager Admin Console

In a web browser, enter http://Server host:7180
Log into Cloudera Manager Admin Console. The default credentials are: Username: admin Password: admin
After logging in, the Cloudera Manager End User License Terms and Conditions page displays. Read the terms and conditions and then select Yes to accept them
Click Continue

Step 7: Choose Cloudera Manager Edition and Hosts

When you start the Cloudera Manager Admin Console, the install wizard starts up. Click Continue to get started
Choose which edition to install
(Optional) If you elect Cloudera Enterprise, install a license
Click Continue to proceed with the installation
Enter the cluster hostnames or IP addresses. You can also specify hostname and IP address ranges. Click Search
Click Continue. The Select Repository screen displays

Step 8: Choose the Software Installation Type and Install Software

Parcel Repository - In the Remote Parcel Repository URLs field, click the + button and enter the URL of the repository
- replace ip-172-31-250-81.cn-north-1.compute.internal with your local repository's hostname
- http://ip-172-31-250-81.cn-north-1.compute.internal/CDH-5.5.0
- http://ip-172-31-250-81.cn-north-1.compute.internal/KAFKA
Select the release of Cloudera Manager Agent. You can choose either the version that matches the Cloudera Manager Server you are currently using or specify a version in a custom repository. If you opted to use custom repositories for installation files, you can provide a GPG key URL that applies for all repositories. Click Continue
- replace ip-172-31-250-81.cn-north-1.compute.internal with your local repository's hostname
- http://ip-172-31-250-81.cn-north-1.compute.internal/cm/5/
- http://ip-172-31-250-81.cn-north-1.compute.internal/cm/RPM-GPG-KEY-cloudera
Select the Install Oracle Java SE Development Kit (JDK) checkbox to allow Cloudera Manager to install the JDK on each cluster host or leave deselected if you installed it. If checked, your local laws permit you to deploy unlimited strength encryption, and you are running a secure cluster, select the Install Java Unlimited Strength Encryption Policy Files checkbox. Click Continue
Do NOT use single user mode when asked. Click Continue
If you chose to have Cloudera Manager install software, specify host installation properties
- Select root or enter the user name for an account that has password-less sudo permission
- Select an authentication method
Click Continue. When the Continue button at the bottom of the screen turns blue, the installation process is completed
Click Continue. The Host Inspector runs to validate the installation and provides a summary of what it finds, including all the versions of the installed components. If the validation is successful, click Finish

Step 9: Add Services

In the first page of the Add Services wizard, choose the combination of services to install and whether to install Cloudera Navigator. Click Continue
Customize the assignment of role instances to hosts. When you are satisfied with the assignments, click Continue
On the Database Setup page, configure settings for required databases. Click Test Connection to confirm that Cloudera Manager can communicate with the database using the information you have supplied. If the test succeeds in all cases, click Continue
Review the configuration changes to be applied. Confirm the settings entered for file system paths. Click Continue. The wizard starts the services
When all of the services are started, click Continue
Click Finish to proceed to the Cloudera Manager Admin Console Home Page

Step 10: Change the Default Administrator Password

Right-click the logged-in username at the far right of the top navigation bar and select Change Password
Enter the current password and a new password twice, and then click Update

Benchmarking

Michael G. Noll's blog post reviews many of benchmark tools
Step 1: Running a MapReduce Job

Parcel - sudo -u hdfs hadoop jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar pi 10 100
Package - sudo -u hdfs hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar pi 10 100

Step 2: TeraSort benchmark suite

sudo -u hdfs hadoop jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar teragen 1000000 /user/hdfs/terasort-input
sudo -u hdfs hadoop jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar terasort /user/hdfs/terasort-input /user/hdfs/terasort-output
sudo -u hdfs hadoop jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar teravalidate /user/hdfs/terasort-output /user/hdfs/terasort-validate

Step 3: NameNode benchmark (nnbench)

sudo -u hdfs hadoop jar /opt/cloudera/parcels/CDH/lib/hadoop-0.20-mapreduce/hadoop-test-2.6.0-mr1-cdh5.5.0.jar nnbench -operation create_write -maps 12 -reduces 6 -blockSize 1 -bytesToWrite 0 -numberOfFiles 1000 -replicationFactorPerFile 3 -readFileAfterOpen true -baseDir /benchmarks/NNBench

Kerberize the cluster

Document of Cloudera Manager integrate MIT Kerberos documented here
Document of Cloudera Manager integrate FreeIPA by Smoak.Wu of Hypers documented here

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Files

Cloudera Installation.md

Cloudera Installation.md

Introductions & Overview

Selecting Hardware for Your CDH Cluster

Linux configuration/prechecks

Install package repositories for Cloudera Manager and CDH

Install a MySQL server for CM

Install Cloudera Manager and CDH

Benchmarking

Kerberize the cluster

Files

Cloudera Installation.md

Latest commit

History

Cloudera Installation.md

File metadata and controls

Introductions & Overview

Selecting Hardware for Your CDH Cluster

Linux configuration/prechecks

Install package repositories for Cloudera Manager and CDH

Install a MySQL server for CM

Install Cloudera Manager and CDH

Benchmarking

Kerberize the cluster