You need the Rust toolchain (both stable and nightly) to build the project. You can use rustup to manage the Rust toolchain in your local environment.
You also need the following tools when working on the project.
- The Protocol Buffers compiler (
protoc
). - Hatch.
- Maturin.
On macOS, you can install these tools via Homebrew.
brew install protobuf hatch maturin
Run the following commands to verify the code before committing changes.
cargo +nightly fmt && cargo clippy --all-targets --all-features && cargo build && cargo test
The code can be built and tested using the stable toolchain, while the nightly toolchain is required for formatting the code.
Please make sure there are no warnings in the output.
The GitHub Actions workflow runs cargo clippy
with the -D warnings
option,
so that the build will fail if there are any warnings from either the compiler or the linter.
Run the following command to build the Python library using Maturin. The command builds the package inside the default Hatch environment.
hatch run maturin build
If you want to build and install the Python library for local development, run the following command.
hatch run maturin develop
The command installs the source code as an editable package in the Hatch environment, while
The built .so
native library is stored in the source directory. You can then use hatch shell
to enter the Python environment and test the library. Any changes to the Python code will be reflected in the
environment immediately. But if you make changes to the Rust code, you need to run the develop
command again.
If the test fails due to mismatched gold data, use the following command to update the gold data and commit the changes.
env FRAMEWORK_UPDATE_GOLD_DATA=1 cargo test
Please install OpenJDK 17 on your host. You can use any widely-used OpenJDK distribution, such as Amazon Corretto.
It is recommended to set JAVA_HOME
when following the instructions in the next sections.
If the JAVA_HOME
environment variable is not set, the Spark build script will try to find the Java installation using
either
(1) the location of javac
(for Linux), or (2) the output of /usr/libexec/java_home
(for macOS).
Run the following command to clone the Spark project.
git clone [email protected]:apache/spark.git opt/spark
Run the following command to build the Spark project. The command creates a patched PySpark package containing Python code along with the JAR files. Python tests are also included in the patched package.
scripts/spark-tests/build-pyspark.sh
Here are some notes about the build-pyspark.sh
script.
- The script will fail with an error if the Spark directory is not clean. The script internally applies a patch to the repository, and the patch is reverted before the script exits (either successfully or with an error).
- The script can work with an arbitrary Python 3 installation,
since the
setup.py
script in the Spark project only uses the Python standard library. - The script takes a while to run.
On GitHub Actions, it takes about 40 minutes on the default GitHub-hosted runners.
Fortunately, you only need to run this script once, unless there is a change in the Spark patch file.
The patch file is in the
scripts/spark-tests
directory.
We use Hatch to manage Python environments.
The environments are defined in the pyproject.toml
file.
When you run Hatch commands, environments are created in .venvs/
in the project root directory.
You can also run hatch env create
to create the default
environment explicitly, and then configure your IDE
to use this environment (.venvs/default
) for Python development.
- Some environments depend on the patched PySpark package created in the previous section. The patched PySpark package will be installed automatically during the environment creation.
- For this project, all Hatch environments are configured to use
pip
as the package installer for local development, so pip environment variables such asPIP_INDEX_URL
still work. However, it is recommended to also setuv
environment variables such asUV_INDEX_URL
, since Hatch usesuv
as the package installer for internal environments (e.g. when doing static analysis viahatch fmt
.) - Hatch will download prebuilt Python interpreters when the specified Python version for an environment
is not installed on your host. Note that the prebuilt Python interpreters only track the minor version of Python.
If downloading prebuilt Python interpreters fails (e.g. due to network issues), or if you want precise control
over the patch version of Python being used in Hatch environments, you can install Python manually so that
Hatch can pick up the Python installations. For example, you can use pyenv to
install multiple Python versions and run the following command in the project root directory.
The above command creates a
pyenv local 3.8.19 3.9.19 3.10.14 3.11.9
.python-version
file (ignored by Git) in the project root directory, so that multiple Python versions are available onPATH
due to the pyenv shim. These Python versions are then available to Hatch for the environment creation.
Use the following commands to build and run the Spark Connect server powered by the framework.
scripts/spark-tests/run-server.sh
Before running Spark tests, please create the test
Hatch environment using the following commands.
Note that you do not need to run maturin develop
in the test
environment again after you make code changes.
We only use the pytest plugins (pure Python code) from the project, which do not need to be rebuilt by Maturin.
hatch env create test
hatch run test:install-pyspark
hatch run test:maturin develop
After running the Spark Connect server, start another terminal and use the following command to run the Spark tests.
The test logs will be written to tmp/spark-tests/<name>
where <name>
is defined by
the TEST_RUN_NAME
environment variable whose default value is latest
.
scripts/spark-tests/run-tests.sh
The above command runs a default set of test suites for Spark Connect.
Each test suite will write its <suite>.jsonl
and <suite>.log
files to the log directory,
where <suite>
is the test suite name.
You can pass arguments to the script, which will be forwarded to pytest
.
You can also use PYTEST_
environment variables to customize the test execution.
For example, PYTEST_ADDOPTS="-k <expression>"
can be used to run specific tests matching <expression>
.
# Write the test logs to a different directory (`tmp/spark-tests/selected`).
export TEST_RUN_NAME=selected
scripts/spark-tests/run-tests.sh --pyargs pyspark.sql.tests.connect -v -k test_sql
When you customize the test execution using the above command, a single test suite will be run,
and the test log files are always test.jsonl
and test.log
in the log directory.
Here are some useful commands to analyze Spark test logs.
You can replace test.jsonl
with a different log file name if you are analyzing a different test suite.
(1) Get the error counts for failed tests.
# You can remove the `--slurpfile baseline tmp/spark-tests/baseline/test.jsonl` arguments
# if you do not have baseline test logs.
jq -r -f scripts/spark-tests/count-errors.jq \
--slurpfile baseline tmp/spark-tests/baseline/test.jsonl \
tmp/spark-tests/latest/test.jsonl | less
(2) Show a sorted list of passed tests.
jq -r -f scripts/spark-tests/show-passed-tests.jq \
tmp/spark-tests/latest/test.jsonl | less
(3) Show the differences of passed tests between two runs.
diff -U 0 \
<(jq -r -f scripts/spark-tests/show-passed-tests.jq tmp/spark-tests/baseline/test.jsonl) \
<(jq -r -f scripts/spark-tests/show-passed-tests.jq tmp/spark-tests/latest/test.jsonl)
You can use the following commands to start a local PySpark session.
# Run the PySpark shell using the original Java implementation.
hatch run pyspark
# Run the PySpark shell using the Spark Connect implementation.
# You can ignore the "sparkContext() is not implemented" error when the shell starts.
env SPARK_REMOTE="sc://localhost:50051" hatch run pyspark
The Spark tests are triggered in GitHub Actions for pull requests,
either when the pull request is opened or when the commit message contains [spark tests]
(case-insensitive).
The Spark tests are always run when the pull request is merged into the main
branch.
Since we use PyO3 to support Python binding in Rust, we need some additional setup to run the Rust debugger in RustRover. In Run > Edit Configurations, add a new Cargo configuration with the following settings:
- Name: Run Spark Connect server (You can use any name you like.)
- Command:
run -p framework-spark-connect
- Environment Variables:
- (required)
PYTHONPATH
:.venvs/default/lib/python<version>/site-packages
(Please replace<version>
with the actual Python version, e.g.3.11
.) - (required)
PYO3_PYTHON
:<project>/.venvs/default/bin/python
(Please replace<project>
with the actual project path. This must be an absolute path.) - (required)
RUST_MIN_STACK
:8388608
- (optional)
RUST_BACKTRACE
:full
- (optional)
RUST_LOG
:framework_spark_connect=debug
- (required)
When entering environment variables, you can click on the button on the right side of the input box to open the dialog and add the environment variables one by one.
You can leave the other settings as default.
The PyO3 package will be rebuilt when the Python interpreter changes.
This will cause all downstream packages to be rebuilt, resulting in a long build time for development.
The issue gets more complicated when you use both command line tools and IDEs, which share the same Cargo build cache.
(For example, RustRover may run cargo check
in the background.)
To reduce the build time, you need to make sure the Python interpreter used by PyO3 is configured in the same way across environments. Please consider the following items.
- Please always invoke Maturin via Hatch (e.g.
hatch run maturin develop
andhatch run maturin build
). In this way, Maturin internally sets thePYO3_PYTHON
environment variable to the absolute path of the Python interpreter of the project's default Hatch environment. - The
scripts/spark-tests/run-server.sh
script internally sets thePYO3_PYTHON
environment variable to the same value as above. - The RustRover debugger configuration in the previous section sets the
PYO3_PYTHON
environment variable to the same value as above. - For RustRover, in "Preferences" > "Rust" > "External Linters", set the
PYO3_PYTHON
environment variable to the same value as above. - If you need to run Cargo commands such as
cargo build
, set thePYO3_PYTHON
environment variable in the terminal session to the same value as above.# Run the following command in the project root directory. export PYO3_PYTHON="$(hatch env find)/bin/python"
Note that the maturin
command and the cargo
command enables different features for PyO3 (e.g. extension-module
).
So if you alternate between the two build tools, the PyO3 library will still be rebuilt.
If you run hatch build
, it uses Maturin as the build system, and the build happens in an isolated Python environment.
So the build does not interfere with the Cargo build cache in target/
. However, it also means that a fresh build
is performed every time, which can be slow. Therefore, it is not recommended to use hatch build
for local development.
Occasionally, you may need to patch the Spark source code further. Here are the commands that can be helpful for this purpose.
# Apply the patch.
# You can now modify the Spark source code.
git -C opt/spark apply ../../scripts/spark-tests/spark-3.5.1.patch
# Update the Spark patch file with your local modification.
git -C opt/spark add .
git -C opt/spark diff --staged -p > scripts/spark-tests/spark-3.5.1.patch
# Revert the patch.
git -C opt/spark reset
git -C opt/spark apply -R ../../scripts/spark-tests/spark-3.5.1.patch
However, note that we should keep the patch minimal. It is possible to alter many Spark test behaviors at runtime via monkey-patching using pytest fixtures.