This is it: a Docker multi-container environment with Hadoop (HDFS) and Spark. But without the large memory requirements of a Cloudera sandbox. (On my Windows 10 laptop (with WSL2) it seems to consume a mere 3 GB.)
To deploy an the HDFS-Spark cluster, run:
docker-compose up
docker-compose
creates a docker network that can be found by running docker network list
, e.g. docker-hadoop-spark
.
Run docker network inspect
on the network (e.g. docker-hadoop-spark
) to find the IP the hadoop interfaces are published on. Access these interfaces with the following URLs:
- Namenode: http://<dockerhadoop_IP_address>:9870/dfshealth.html#tab-overview
- History server: http://<dockerhadoop_IP_address>:8188/applicationhistory
- Datanode: http://<dockerhadoop_IP_address>:9864/
- Nodemanager: http://<dockerhadoop_IP_address>:8042/node
- Resource manager: http://<dockerhadoop_IP_address>:8088/
- Spark master: http://<dockerhadoop_IP_address>:8080/
- Spark worker 1: http://<dockerhadoop_IP_address>:8081/
- Spark worker 2: http://<dockerhadoop_IP_address>:8082/
- Spark worker 3: http://<dockerhadoop_IP_address>:8083/
Since Docker Desktop turned “Expose daemon on tcp://localhost:2375 without TLS” off by default there have been all kinds of connection problems running the complete docker-compose. Turning this option on again (Settings > General > Expose daemon on tcp://localhost:2375 without TLS) makes it all work. I’m still looking for a more secure solution to this.
Copy breweries.csv to the namenode.
docker cp breweries.csv namenode:breweries.csv
Go to the bash shell on the namenode with that same Container ID of the namenode.
docker exec -it namenode bash
Create a HDFS directory /data//openbeer/breweries.
hdfs dfs -mkdir -p /data/openbeer/breweries
Copy breweries.csv to HDFS:
hdfs dfs -put breweries.csv /data/openbeer/breweries/breweries.csv
Go to http://<dockerhadoop_IP_address>:8080 or http://localhost:8080/ on your Docker host (laptop) to see the status of the Spark master.
Go to the command line of the Spark master and start PySpark.
docker exec -it spark-master bash
/spark/bin/pyspark --master spark://spark-master:7077
Load breweries.csv from HDFS.
brewfile = spark.read.csv("hdfs://namenode:9000/data/openbeer/breweries/breweries.csv")
brewfile.show()
+----+--------------------+-------------+-----+---+
| _c0| _c1| _c2| _c3|_c4|
+----+--------------------+-------------+-----+---+
|null| name| city|state| id|
| 0| NorthGate Brewing | Minneapolis| MN| 0|
| 1|Against the Grain...| Louisville| KY| 1|
| 2|Jack's Abby Craft...| Framingham| MA| 2|
| 3|Mike Hess Brewing...| San Diego| CA| 3|
| 4|Fort Point Beer C...|San Francisco| CA| 4|
| 5|COAST Brewing Com...| Charleston| SC| 5|
| 6|Great Divide Brew...| Denver| CO| 6|
| 7| Tapistry Brewing| Bridgman| MI| 7|
| 8| Big Lake Brewing| Holland| MI| 8|
| 9|The Mitten Brewin...| Grand Rapids| MI| 9|
| 10| Brewery Vivant| Grand Rapids| MI| 10|
| 11| Petoskey Brewing| Petoskey| MI| 11|
| 12| Blackrocks Brewery| Marquette| MI| 12|
| 13|Perrin Brewing Co...|Comstock Park| MI| 13|
| 14|Witch's Hat Brewi...| South Lyon| MI| 14|
| 15|Founders Brewing ...| Grand Rapids| MI| 15|
| 16| Flat 12 Bierwerks| Indianapolis| IN| 16|
| 17|Tin Man Brewing C...| Evansville| IN| 17|
| 18|Black Acre Brewin...| Indianapolis| IN| 18|
+----+--------------------+-------------+-----+---+
only showing top 20 rows
Go to http://<dockerhadoop_IP_address>:8080 or http://localhost:8080/ on your Docker host (laptop) to see the status of the Spark master.
Go to the command line of the Spark master and start spark-shell.
docker exec -it spark-master bash
spark/bin/spark-shell --master spark://spark-master:7077
Load breweries.csv from HDFS.
val df = spark.read.csv("hdfs://namenode:9000/data/openbeer/breweries/breweries.csv")
df.show()
+----+--------------------+-------------+-----+---+
| _c0| _c1| _c2| _c3|_c4|
+----+--------------------+-------------+-----+---+
|null| name| city|state| id|
| 0| NorthGate Brewing | Minneapolis| MN| 0|
| 1|Against the Grain...| Louisville| KY| 1|
| 2|Jack's Abby Craft...| Framingham| MA| 2|
| 3|Mike Hess Brewing...| San Diego| CA| 3|
| 4|Fort Point Beer C...|San Francisco| CA| 4|
| 5|COAST Brewing Com...| Charleston| SC| 5|
| 6|Great Divide Brew...| Denver| CO| 6|
| 7| Tapistry Brewing| Bridgman| MI| 7|
| 8| Big Lake Brewing| Holland| MI| 8|
| 9|The Mitten Brewin...| Grand Rapids| MI| 9|
| 10| Brewery Vivant| Grand Rapids| MI| 10|
| 11| Petoskey Brewing| Petoskey| MI| 11|
| 12| Blackrocks Brewery| Marquette| MI| 12|
| 13|Perrin Brewing Co...|Comstock Park| MI| 13|
| 14|Witch's Hat Brewi...| South Lyon| MI| 14|
| 15|Founders Brewing ...| Grand Rapids| MI| 15|
| 16| Flat 12 Bierwerks| Indianapolis| IN| 16|
| 17|Tin Man Brewing C...| Evansville| IN| 17|
| 18|Black Acre Brewin...| Indianapolis| IN| 18|
+----+--------------------+-------------+-----+---+
only showing top 20 rows
How cool is that? Your own Spark cluster to play with.
The configuration parameters can be specified in the hadoop.env file or as environmental variables for specific services (e.g. namenode, datanode etc.):
CORE_CONF_fs_defaultFS=hdfs://namenode:8020
CORE_CONF corresponds to core-site.xml. fs_defaultFS=hdfs://namenode:8020 will be transformed into:
<property><name>fs.defaultFS</name><value>hdfs://namenode:8020</value></property>
To define dash inside a configuration parameter, use triple underscore, such as YARN_CONF_yarn_log___aggregation___enable=true (yarn-site.xml):
<property><name>yarn.log-aggregation-enable</name><value>true</value></property>
The available configurations are:
- /etc/hadoop/core-site.xml CORE_CONF
- /etc/hadoop/hdfs-site.xml HDFS_CONF
- /etc/hadoop/yarn-site.xml YARN_CONF
- /etc/hadoop/httpfs-site.xml HTTPFS_CONF
- /etc/hadoop/kms-site.xml KMS_CONF
- /etc/hadoop/mapred-site.xml MAPRED_CONF
If you need to extend some other configuration file, refer to base/entrypoint.sh bash script.