Hadoop

Hadoop into Docker container

couldera container start

docker run --hostname=quickstart.cloudera --privileged=true -t -i -p 7180 4239cd2958c6 /usr/bin/docker-quickstart

couldera run:

HDFS common commands

help ( Distributed File System )

hdfs dfs -help
hdfs dfs -help copyFromLocal
hdfs dfs -help ls
hdfs dfs -help cat
hdfs dfs -help setrep

list files

hdfs dfs -ls /user/root/input
hdfs dfs -ls hdfs://hadoop-local:9000/data

output example:

-rw-r--r--   1 root supergroup       5107 2017-10-27 12:57 hdfs://hadoop-local:9000/data/Iris.csv
             ^ factor of replication

change factor of replication

hdfs dfs -setrep -w 4 /data/file.txt

create folder

hdfs dfs -mkdir /data

copy files from local filesystem to remote

hdfs dfs -put /home/root/tmp/Iris.csv /data/
hdfs dfs -copyFromLocal /home/root/tmp/Iris.csv /data/

copy files from local filesystem to remote with replication factor

hdfs dfs -Ddfs.replication=2 -put /path/to/local/file /path/to/hdfs

copy ( small files only !!! ) from local to remote ( read from DataNodes and write to DataNodes !!!)

hdfs dfs -cp /home/root/tmp/Iris.csv /data/

remote copy ( not used client as pipe )

hdfs distcp /home/root/tmp/Iris.csv /data/

read data from DataNode

hdfs get /path/to/hdfs /path/to/local/file
hdfs dfs -copyToLocal /path/to/hdfs /path/to/local/file

remove data from HDFS ( to Trash !!! special for each user)

hdfs rm -r /path/to/hdfs-folder

remove data from HDFS

hdfs rm -r -skipTrash /path/to/hdfs-folder

clean up trash bin

hdfs dfs -expunge

file info ( disk usage )

hdfs dfs -du -h /path/to/hdfs-folder

is file/folder exists ?

hdfs dfs -test /path/to/hdfs-folder

list of files ( / - root )

hdfs dfs -ls /
hdfs dfs -ls hdfs://192.168.1.10:8020/path/to/folder

the same as previous but with fs.defalut.name = hdfs://192.168.1.10:8020

hdfs dfs -ls /path/to/folder
hdfs dfs -ls file:///local/path   ==   (ls /local/path)

show all sub-folders

hdfs dfs -ls -r

standard command for hdsf

-touchz, -cat (-text), -tail, -mkdir, -chmod, -chown, -count ....

Hadoop governance, administration

filesystem capacity

hdfs dfs -df -h

file system check, reporting, file system information

hdfs fsck /

balancer for distributed file system, necessary after failing/removing/eliminating some DataNode(s)

hdfs balancer

administration of the filesystem

hdfs dfsadmin -help

show statistic

hdfs dfsadmin -report

HDFS to "read-only" mode for external users

hdfs dfsadmin -safemode
hdfs dfsadmin -upgrade
hdfs dfsadmin -backup

job execution

hadoop jar {path to jar} {classname}
jarn jar {path to jar} {classname}

Hortonworks sandbox

tutorial.credentials

Web SSH

localhost:4200
root/hadoop

SSH access

ssh root@localhost -p 2222

ambari password reset

shell web client (aka shell-in-a-box): localhost:4200 root / hadoop
ambari-admin-password-reset
ambari-agent restart
login into ambari: localhost:8080 admin/{your password}

Zeppelin UI

http://localhost:9995

install jupyter for spark

https://hortonworks.com/hadoop-tutorial/using-ipython-notebook-with-apache-spark/

PARK_MAJOR_VERSION is set to 2, using Spark2
Error in pyspark startup:
IPYTHON and IPYTHON_OPTS are removed in Spark 2.0+. Remove these from the environment and set PYSPARK_DRIVER_PYTHON and PYSPARK_DRIVER_PYTHON_OPTS instead.
just set variable to using Spart1 inside script: SPARK_MAJOR_VERSION=1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hadoop.md

hadoop.md

Hadoop

Hadoop into Docker container

couldera container start

couldera run:

HDFS common commands

help ( Distributed File System )

list files

change factor of replication

create folder

copy files from local filesystem to remote

copy files from local filesystem to remote with replication factor

copy ( small files only !!! ) from local to remote ( read from DataNodes and write to DataNodes !!!)

remote copy ( not used client as pipe )

read data from DataNode

remove data from HDFS ( to Trash !!! special for each user)

remove data from HDFS

clean up trash bin

file info ( disk usage )

is file/folder exists ?

list of files ( / - root )

standard command for hdsf

Hadoop governance, administration

filesystem capacity

file system check, reporting, file system information

balancer for distributed file system, necessary after failing/removing/eliminating some DataNode(s)

administration of the filesystem

job execution

Hortonworks sandbox

Web SSH

SSH access

ambari password reset

Zeppelin UI

install jupyter for spark

Files

hadoop.md

Latest commit

History

hadoop.md

File metadata and controls

Hadoop

Hadoop into Docker container

couldera container start

couldera run:

HDFS common commands

help ( Distributed File System )

list files

change factor of replication

create folder

copy files from local filesystem to remote

copy files from local filesystem to remote with replication factor

copy ( small files only !!! ) from local to remote ( read from DataNodes and write to DataNodes !!!)

remote copy ( not used client as pipe )

read data from DataNode

remove data from HDFS ( to Trash !!! special for each user)

remove data from HDFS

clean up trash bin

file info ( disk usage )

is file/folder exists ?

list of files ( / - root )

standard command for hdsf

Hadoop governance, administration

filesystem capacity

file system check, reporting, file system information

balancer for distributed file system, necessary after failing/removing/eliminating some DataNode(s)

administration of the filesystem

job execution

Hortonworks sandbox

Web SSH

SSH access

ambari password reset

Zeppelin UI

install jupyter for spark