Skip to content

Example of local pyspark setup including DeltaLake for unit-testing

License

Notifications You must be signed in to change notification settings

frizzleqq/pyspark-deltalake

Repository files navigation

Local env with spark+delta

Minimal example of a local Python setup with Spark and DeltaLake that allows unit-testing spark/delta via pytest.

The setup is inspired by dbx by Databricks.

Delta

To include Delta in the Spark session created by pytest the spark fixture in ./tests/conftest.py runs configure_spark_with_delta_pip and adds the following settings to the spark config:

key value
spark.sql.extensions io.delta.sql.DeltaSparkSessionExtension
spark.sql.catalog.spark_catalog org.apache.spark.sql.delta.catalog.DeltaCatalog

See https://docs.delta.io/3.2.0/quick-start.html#python for more info.

Development

Requirements:

Setup Virtual environment

Following commands create and activate a virtual environment.

  • The [dev] also installs development tools.
  • The --editable makes the CLI script available.

Commands:

  • Makefile:
    make requirements
    source .venv/bin/activate
  • Windows:
    python -m venv .venv
    .venv\Scripts\activate
    python -m pip install --upgrade uv
    uv pip install --editable .[dev]

Updating locked dependencies

To lock dependencies from pyproject.toml into requirements.txt files:

  • Without dev dependencies:

    uv pip compile pyproject.toml -o requirements.txt
    
  • With dev dependencies:

    uv pip compile pyproject.toml --extra dev -o requirements-dev.txt
    
    • We use uv pip install instead of uv pip sync to also have an editable install.

Windows

I recommend using wsl instead, as even with the additional hadoop libraries spark-delta occasionally simply freezes.

To run this on Windows you need additional Haddop libraries, see https://cwiki.apache.org/confluence/display/HADOOP2/WindowsProblems.

"In particular, %HADOOP_HOME%\BIN\WINUTILS.EXE must be locatable."

  1. Download the bin directory https://github.com/steveloughran/winutils/tree/master/hadoop-3.0.0/bin (required files: hadoop.dll and winutils.exe)
  2. Set environment variable HADOOP_HOME to the directory above the bin directory

Run tests

  • Makefile
make test
  • Windows
pytest -vv

About

Example of local pyspark setup including DeltaLake for unit-testing

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published