Skip to content

Latest commit

 

History

History
executable file
·
468 lines (408 loc) · 10.9 KB

hadoop.md

File metadata and controls

executable file
·
468 lines (408 loc) · 10.9 KB

Hadoop

Hadoop into Docker container

  • MapR
  • Hortonworks
  • Cloudera
docker run --hostname=quickstart.cloudera --privileged=true -t -i -p 7180 4239cd2958c6 /usr/bin/docker-quickstart

couldera run:

IBM education container start

docker run -it --name bdu_spark2 -P -p 4040:4040 -p 4041:4041 -p 8080:8080 -p8081:8081 bigdatauniversity/spark2:latest
-- /etc/bootstrap.sh -bash 

hadoop2 yarn hadoop1 sensor use case sensor use case sensor use case sensor use case sensor use case sensor use case

HDFS common commands

admin command, cluster settings

hdfs dfsadmin -report

list of namenodes, list of secondary nodes

hdfs getconf -namenodes
hdfs getconf -secondaryNameNodes
hdfs getconf -confKey dfs.namenode.name.dir

confKey:

  • dfs.namenode.name.dir
  • fs.defaultFS
  • yarn.resourcemanager.address
  • mapreduce.framework.name
  • dfs.namenode.name.dir
  • dfs.default.chunk.view.size
  • dfs.namenode.fs-limits.max-blocks-per-file
  • dfs.permissions.enabled
  • dfs.namenode.acls.enabled
  • dfs.replication
  • dfs.replication.max
  • dfs.namenode.replication.min
  • dfs.blocksize
  • dfs.client.block.write.retries
  • dfs.hosts.exclude
  • dfs.namenode.checkpoint.edits.dir
  • dfs.image.compress
  • dfs.image.compression.codec
  • dfs.user.home.dir.prefix
  • dfs.permissions.enabled
  • io.file.buffer.size
  • io.bytes-per-checksum
  • io.seqfile.local.dir

help ( Distributed File System )

hdfs dfs -help
hdfs dfs -help copyFromLocal
hdfs dfs -help ls
hdfs dfs -help cat
hdfs dfs -help setrep

list files

hdfs dfs -ls /user/root/input
hdfs dfs -ls hdfs://hadoop-local:9000/data

output example:

-rw-r--r--   1 root supergroup       5107 2017-10-27 12:57 hdfs://hadoop-local:9000/data/Iris.csv
             ^ factor of replication

files count

hdfs dfs -count /user/root/input

where 1-st column - amount of folder ( +1 current ), where 2-nd column - amount of files into folder where 3-rd column - size of folder

check if folder exists

hdfs dfs -test -d /user/root/some_folder
echo $?

0 - exists 1 - not exits

checksum ( md5sum )

hdfs dfs -checksum <path to file>

hdfs logic emulator

java -jar HadoopChecksumForLocalfile-1.0.jar V717777_MDF4_20190201.MF4 0 512 CRC32C

locally only

hdfs dfs -cat <path to file> | md5sum

find folders ( for cloudera only !!! )

hadoop jar /opt/cloudera/parcels/CDH/jars/search-mr-1.0.0-cdh5.14.4-job.jar org.apache.solr.hadoop.HdfsFindTool -find hdfs:///data/ingest/ -type d -name "some-name-of-the-directory"

find files ( for cloudera only !!! )

hadoop jar /opt/cloudera/parcels/CDH/jars/search-mr-1.0.0-cdh5.14.4-job.jar org.apache.solr.hadoop.HdfsFindTool -find hdfs:///data/ingest/ -type f -name "some-name-of-the-file"

change factor of replication

hdfs dfs -setrep -w 4 /data/file.txt

create folder

hdfs dfs -mkdir /data 

copy files from local filesystem to remote

hdfs dfs -put /home/root/tmp/Iris.csv /data/
hdfs dfs -copyFromLocal /home/root/tmp/Iris.csv /data/

copy files from local filesystem to remote with replication factor

hdfs dfs -Ddfs.replication=2 -put /path/to/local/file /path/to/hdfs

copy ( small files only !!! ) from local to remote ( read from DataNodes and write to DataNodes !!!)

hdfs dfs -cp /home/root/tmp/Iris.csv /data/

remote copy ( not used client as pipe )

hadoop distcp /home/root/tmp/Iris.csv /data/

read data from DataNode

hdfs get /path/to/hdfs /path/to/local/file
hdfs dfs -copyToLocal /path/to/hdfs /path/to/local/file

remove data from HDFS ( to Trash !!! special for each user)

hdfs rm -r /path/to/hdfs-folder

remove data from HDFS

hdfs rm -r -skipTrash /path/to/hdfs-folder

clean up trash bin

hdfs dfs -expunge

file info ( disk usage )

hdfs dfs -du -h /path/to/hdfs-folder

is file/folder exists ?

hdfs dfs -test /path/to/hdfs-folder

list of files ( / - root )

hdfs dfs -ls /
hdfs dfs -ls hdfs://192.168.1.10:8020/path/to/folder

the same as previous but with fs.defalut.name = hdfs://192.168.1.10:8020

hdfs dfs -ls /path/to/folder
hdfs dfs -ls file:///local/path   ==   (ls /local/path)

show all sub-folders

hdfs dfs -ls -r 

standard command for hdsf

-touchz, -cat (-text), -tail, -mkdir, -chmod, -chown, -count ....

java application run, java run, java build

hadoop classpath
hadoop classpath glob
# javac -classpath `hadoop classpath` MyProducer.java

Hadoop governance, administration

filesystem capacity, disk usage in human readable format

hdfs dfs -df -h

file system check, reporting, file system information

hdfs fsck /

balancer for distributed file system, necessary after failing/removing/eliminating some DataNode(s)

hdfs balancer

administration of the filesystem

hdfs dfsadmin -help

show statistic

hdfs dfsadmin -report

HDFS to "read-only" mode for external users

hdfs dfsadmin -safemode
hdfs dfsadmin -upgrade
hdfs dfsadmin -backup

Security

  • File permissions ( posix attributes )
  • Hive ( grant revoke )
  • Knox ( REST API for hadoop )
  • Ranger

job execution

hadoop jar {path to jar} {classname}
jarn jar {path to jar} {classname}

application list on YARN

yarn application --list

application list with ALL states

yarn application -list -appStates ALL

application status

yarn application -status application_1555573258694_20981

application kill on YARN

yarn application -kill application_1540813402987_3657

application log on YARN

yarn logs -applicationId application_1540813402987_3657 | less

application log on YARN by user

yarn logs -applicationId application_1540813402987_3657 -appOwner my_tech_user | less

Hortonworks sandbox

tutorials ecosystem sandbox tutorial download install instruction getting started

Web SSH

localhost:4200
root/hadoop

SSH access

ssh root@localhost -p 2222

setup after installation, init, ambari password reset

  • shell web client (aka shell-in-a-box): localhost:4200 root / hadoop
  • ambari-admin-password-reset
  • ambari-agent restart
  • login into ambari: localhost:8080 admin/{your password}

Zeppelin UI

http://localhost:9995 user: maria_dev pass: maria_dev

install jupyter for spark

https://hortonworks.com/hadoop-tutorial/using-ipython-notebook-with-apache-spark/

PARK_MAJOR_VERSION is set to 2, using Spark2
Error in pyspark startup:
IPYTHON and IPYTHON_OPTS are removed in Spark 2.0+. Remove these from the environment and set PYSPARK_DRIVER_PYTHON and PYSPARK_DRIVER_PYTHON_OPTS instead.
just set variable to using Spart1 inside script: SPARK_MAJOR_VERSION=1

Sqoop ( SQl to/from hadOOP )

JDBC driver for jdbc url must present: $SQOOP_HOME/lib

import

Import destinations:

  • text files
  • binary files
  • HBase
  • Hive
sqoop import --connect jdbc:mysql://127.0.0.1/crm --username myuser --password mypassword --table customers --target-dir /crm/users/michael.csv  

additional parameter to leverage amount of mappers that working in parallel:

--split-by customer_id_pk

additional parameters:

--fields-terminated-by ','
--columns "name, age, address"
--where "age>30"
--query "select name, age, address from customers where age>30"

additional import parameters:

--as-textfile
--as-sequencefile
--as-avrodatafile

export

export modes:

  • insert
sqoop export --connect jdbc:mysql://127.0.0.1/crm --username myuser --password mypassword --export-dir /crm/users/michael.csv --table customers 
  • update
sqoop export --connect jdbc:mysql://127.0.0.1/crm --username myuser --password mypassword --export-dir /crm/users/michael.csv --udpate_key user_id
  • call ( store procedure will be executed )
sqoop export --connect jdbc:mysql://127.0.0.1/crm --username myuser --password mypassword --export-dir /crm/users/michael.csv --call customer_load

additional export parameters:

# row for a single insert
-Dsqoop.export.records.per.statement
# number of insert before commit
-Dexport.statements.per.transaction

java application run, java run, java build

mapr classpath
mapr classpath glob

compile java app, execute java app

javac -classpath `mapr classpath` MyProducer.java
java -classpath `mapr classpath`:. MyProducer

MDF4 reading

header = mdfreader.mdfinfo4.Info4("file.MF4")
header.keys()
header['AT'].keys()
header['AT'][768]['at_embedded_data']
info=mdfreader.mdfinfo()
info.listChannels("file.MF4")
from asammdf import MDF4 as MDF
mdf = MDF("file.MF4")

HCatalog

documentation

table description

hcat -e "describe school_explorer"
hcat -e "describe formatted school_explorer"

SQL engines

  • Impala
  • Phoenix ( HBase )
  • Drill ( schema-less sql )
  • BigSQL ( PostgreSQL + Hadoop )
  • Spark

workflow scheduler

START -> ACTION -> OK | ERROR

Cascading

TBD

Scalding

TBD


Hadoop streaming.

  • Storm ( real time streaming solution )
  • Spark ( near real time streaming, uses microbatching )
  • Samza ( streaming on top of Kafka )
  • Flink ( common approach to batch and stream code development )

Data storage, NoSQL

Accumulo

TBD

Druid

TBD

Cluster management

  • cloudera manager

TODO

download all slides from stepik - for repeating and creating xournals