Minimal example of a local Python setup with Spark and DeltaLake that allows
unit-testing spark/delta via pytest
.
The setup is inspired by dbx by Databricks.
To include Delta in the Spark session created by pytest
the spark
fixture in
./tests/conftest.py runs configure_spark_with_delta_pip
and adds the following settings to the spark config:
key | value |
---|---|
spark.sql.extensions | io.delta.sql.DeltaSparkSessionExtension |
spark.sql.catalog.spark_catalog | org.apache.spark.sql.delta.catalog.DeltaCatalog |
See https://docs.delta.io/3.2.0/quick-start.html#python for more info.
Requirements:
- Python >= 3.10
- Java 8, 11 or 17 for Spark (https://spark.apache.org/docs/3.5.1/#downloading)
JAVA_HOME
must be set
Following commands create and activate a virtual environment.
- The
[dev]
also installs development tools. - The
--editable
makes the CLI script available.
Commands:
- Makefile:
make requirements source .venv/bin/activate
- Windows:
python -m venv .venv .venv\Scripts\activate python -m pip install --upgrade uv uv pip install --editable .[dev]
To lock dependencies from pyproject.toml
into requirements.txt
files:
-
Without dev dependencies:
uv pip compile pyproject.toml -o requirements.txt
-
With dev dependencies:
uv pip compile pyproject.toml --extra dev -o requirements-dev.txt
- We use
uv pip install
instead ofuv pip sync
to also have an editable install.
- We use
I recommend using wsl instead, as even with the additional hadoop libraries spark-delta occasionally simply freezes.
To run this on Windows you need additional Haddop libraries, see https://cwiki.apache.org/confluence/display/HADOOP2/WindowsProblems.
"In particular, %HADOOP_HOME%\BIN\WINUTILS.EXE must be locatable."
- Download the
bin
directory https://github.com/steveloughran/winutils/tree/master/hadoop-3.0.0/bin (required files:hadoop.dll
andwinutils.exe
) - Set environment variable
HADOOP_HOME
to the directory above thebin
directory
- Makefile
make test
- Windows
pytest -vv