Docker multi-container environment with Hadoop and Spark

This is it: a Docker multi-container environment with Hadoop (HDFS) and Spark. But without the large memory requirements of a Cloudera sandbox. (On my Windows 10 laptop (with WSL2) it seems to consume a mere 3 GB.)

Quick Start

To deploy an the HDFS-Spark cluster, run:

  docker-compose up

docker-compose creates a docker network that can be found by running docker network list, e.g. docker-hadoop-spark.

Run docker network inspect on the network (e.g. docker-hadoop-spark) to find the IP the hadoop interfaces are published on. Access these interfaces with the following URLs:

Namenode: http://<dockerhadoop_IP_address>:9870/dfshealth.html#tab-overview
History server: http://<dockerhadoop_IP_address>:8188/applicationhistory
Datanode: http://<dockerhadoop_IP_address>:9864/
Nodemanager: http://<dockerhadoop_IP_address>:8042/node
Resource manager: http://<dockerhadoop_IP_address>:8088/
Spark master: http://<dockerhadoop_IP_address>:8080/
Spark worker 1: http://<dockerhadoop_IP_address>:8081/
Spark worker 2: http://<dockerhadoop_IP_address>:8082/
Spark worker 3: http://<dockerhadoop_IP_address>:8083/

Important note regarding Docker Desktop

Since Docker Desktop turned “Expose daemon on tcp://localhost:2375 without TLS” off by default there have been all kinds of connection problems running the complete docker-compose. Turning this option on again (Settings > General > Expose daemon on tcp://localhost:2375 without TLS) makes it all work. I’m still looking for a more secure solution to this.

Quick Start HDFS

Copy breweries.csv to the namenode.

  docker cp breweries.csv namenode:breweries.csv

Go to the bash shell on the namenode with that same Container ID of the namenode.

  docker exec -it namenode bash

Create a HDFS directory /data//openbeer/breweries.

  hdfs dfs -mkdir -p /data/openbeer/breweries

Copy breweries.csv to HDFS:

  hdfs dfs -put breweries.csv /data/openbeer/breweries/breweries.csv

Quick Start Spark (PySpark)

Go to http://<dockerhadoop_IP_address>:8080 or http://localhost:8080/ on your Docker host (laptop) to see the status of the Spark master.

Go to the command line of the Spark master and start PySpark.

  docker exec -it spark-master bash

  /spark/bin/pyspark --master spark://spark-master:7077

Load breweries.csv from HDFS.

  brewfile = spark.read.csv("hdfs://namenode:9000/data/openbeer/breweries/breweries.csv")
  
  brewfile.show()
+----+--------------------+-------------+-----+---+
| _c0|                 _c1|          _c2|  _c3|_c4|
+----+--------------------+-------------+-----+---+
|null|                name|         city|state| id|
|   0|  NorthGate Brewing |  Minneapolis|   MN|  0|
|   1|Against the Grain...|   Louisville|   KY|  1|
|   2|Jack's Abby Craft...|   Framingham|   MA|  2|
|   3|Mike Hess Brewing...|    San Diego|   CA|  3|
|   4|Fort Point Beer C...|San Francisco|   CA|  4|
|   5|COAST Brewing Com...|   Charleston|   SC|  5|
|   6|Great Divide Brew...|       Denver|   CO|  6|
|   7|    Tapistry Brewing|     Bridgman|   MI|  7|
|   8|    Big Lake Brewing|      Holland|   MI|  8|
|   9|The Mitten Brewin...| Grand Rapids|   MI|  9|
|  10|      Brewery Vivant| Grand Rapids|   MI| 10|
|  11|    Petoskey Brewing|     Petoskey|   MI| 11|
|  12|  Blackrocks Brewery|    Marquette|   MI| 12|
|  13|Perrin Brewing Co...|Comstock Park|   MI| 13|
|  14|Witch's Hat Brewi...|   South Lyon|   MI| 14|
|  15|Founders Brewing ...| Grand Rapids|   MI| 15|
|  16|   Flat 12 Bierwerks| Indianapolis|   IN| 16|
|  17|Tin Man Brewing C...|   Evansville|   IN| 17|
|  18|Black Acre Brewin...| Indianapolis|   IN| 18|
+----+--------------------+-------------+-----+---+
only showing top 20 rows

Quick Start Spark (Scala)

Go to http://<dockerhadoop_IP_address>:8080 or http://localhost:8080/ on your Docker host (laptop) to see the status of the Spark master.

Go to the command line of the Spark master and start spark-shell.

  docker exec -it spark-master bash
  
  spark/bin/spark-shell --master spark://spark-master:7077

Load breweries.csv from HDFS.

  val df = spark.read.csv("hdfs://namenode:9000/data/openbeer/breweries/breweries.csv")
  
  df.show()
+----+--------------------+-------------+-----+---+
| _c0|                 _c1|          _c2|  _c3|_c4|
+----+--------------------+-------------+-----+---+
|null|                name|         city|state| id|
|   0|  NorthGate Brewing |  Minneapolis|   MN|  0|
|   1|Against the Grain...|   Louisville|   KY|  1|
|   2|Jack's Abby Craft...|   Framingham|   MA|  2|
|   3|Mike Hess Brewing...|    San Diego|   CA|  3|
|   4|Fort Point Beer C...|San Francisco|   CA|  4|
|   5|COAST Brewing Com...|   Charleston|   SC|  5|
|   6|Great Divide Brew...|       Denver|   CO|  6|
|   7|    Tapistry Brewing|     Bridgman|   MI|  7|
|   8|    Big Lake Brewing|      Holland|   MI|  8|
|   9|The Mitten Brewin...| Grand Rapids|   MI|  9|
|  10|      Brewery Vivant| Grand Rapids|   MI| 10|
|  11|    Petoskey Brewing|     Petoskey|   MI| 11|
|  12|  Blackrocks Brewery|    Marquette|   MI| 12|
|  13|Perrin Brewing Co...|Comstock Park|   MI| 13|
|  14|Witch's Hat Brewi...|   South Lyon|   MI| 14|
|  15|Founders Brewing ...| Grand Rapids|   MI| 15|
|  16|   Flat 12 Bierwerks| Indianapolis|   IN| 16|
|  17|Tin Man Brewing C...|   Evansville|   IN| 17|
|  18|Black Acre Brewin...| Indianapolis|   IN| 18|
+----+--------------------+-------------+-----+---+
only showing top 20 rows

How cool is that? Your own Spark cluster to play with.

Configure Environment Variables

The configuration parameters can be specified in the hadoop.env file or as environmental variables for specific services (e.g. namenode, datanode etc.):

  CORE_CONF_fs_defaultFS=hdfs://namenode:8020

CORE_CONF corresponds to core-site.xml. fs_defaultFS=hdfs://namenode:8020 will be transformed into:

  <property><name>fs.defaultFS</name><value>hdfs://namenode:8020</value></property>

To define dash inside a configuration parameter, use triple underscore, such as YARN_CONF_yarn_log___aggregation___enable=true (yarn-site.xml):

  <property><name>yarn.log-aggregation-enable</name><value>true</value></property>

The available configurations are:

/etc/hadoop/core-site.xml CORE_CONF
/etc/hadoop/hdfs-site.xml HDFS_CONF
/etc/hadoop/yarn-site.xml YARN_CONF
/etc/hadoop/httpfs-site.xml HTTPFS_CONF
/etc/hadoop/kms-site.xml KMS_CONF
/etc/hadoop/mapred-site.xml MAPRED_CONF

If you need to extend some other configuration file, refer to base/entrypoint.sh bash script.

Name		Name	Last commit message	Last commit date
Latest commit History 101 Commits
base		base
conf		conf
datanode		datanode
historyserver		historyserver
master		master
namenode		namenode
nginx		nginx
nodemanager		nodemanager
resourcemanager		resourcemanager
submit		submit
template		template
workdir		workdir
worker		worker
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
breweries.csv		breweries.csv
docker-compose.yml		docker-compose.yml
entrypoint.sh		entrypoint.sh
hadoop-hive.env		hadoop-hive.env
hadoop.env		hadoop.env
spark_in_action.MD		spark_in_action.MD
startup.sh		startup.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Docker multi-container environment with Hadoop and Spark

Quick Start

Important note regarding Docker Desktop

Quick Start HDFS

Quick Start Spark (PySpark)

Quick Start Spark (Scala)

Configure Environment Variables

About

Releases

Packages

Languages

chrdebru/docker-hadoop-spark

Folders and files

Latest commit

History

Repository files navigation

Docker multi-container environment with Hadoop and Spark

Quick Start

Important note regarding Docker Desktop

Quick Start HDFS

Quick Start Spark (PySpark)

Quick Start Spark (Scala)

Configure Environment Variables

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages