Jupyter Notebook PySpark Demo

Architecture

Clone this project from GitHub:

git clone \
    --branch v2 --single-branch --depth 1 --no-tags \
    https://github.com/garystafford/pyspark-setup-demo.git

Create $HOME/data/postgres directory for PostgreSQL files: mkdir -p ~/data/postgres
Optional, for local development, install Python packages: python3 -m pip install -r requirements.txt

Optional, pull docker images first:

docker pull jupyter/all-spark-notebook:latest
docker pull postgres:12-alpine
docker pull adminer:latest

Deploy Docker Stack: docker stack deploy -c stack.yml jupyter
Retrieve the token to log into Jupyter: docker logs $(docker ps | grep jupyter_spark | awk '{print $NF}')
From the Jupyter terminal, run the install script: sh bootstrap_jupyter.sh

Export your Plotly username and api key to .env file:

echo "PLOTLY_USERNAME=your-username" >> .env
echo "PLOTLY_API_KEY=your-api-key" >> .env

From a Jupyter terminal window:

Sample Python script, run python3 01_simple_script.py from Jupyter terminal
Sample PySpark job, run $SPARK_HOME/bin/spark-submit 02_pyspark_job.py from Jupyter terminal
Load PostgreSQL sample data, run python3 03_load_sql.py from Jupyter terminal
Sample Jupyter Notebook, open 04_notebook.ipynb from Jupyter Console
Sample Jupyter Notebook, open 05_notebook.ipynb from Jupyter Console
Try the alternate Jupyter stack with nbextensions pre-installed, first cd docker_nbextensions/, then run docker build -t garystafford/all-spark-notebook-nbext:latest . to build the new image
Then, to delete the previous stack, run docker stack rm jupyter, followed by creating the new stack, run cd - and docker stack deploy -c stack-nbext.yml jupyter

Name		Name	Last commit message	Last commit date
Latest commit History 133 Commits
docker_nbextensions		docker_nbextensions
work		work
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
stack-nbext.yml		stack-nbext.yml
stack.yml		stack.yml