structure

This is a sample project for Databricks, generated via cookiecutter.

While using this project, you need Python 3.X and pip or conda for package management.

Local environment setup

Instantiate a local Python environment via a tool of your choice. This example is based on conda, but you can use any environment management tool:

conda create -n structure python=3.9
conda activate structure

If you don't have JDK installed on your local machine, install it (in this example we use conda-based installation):

conda install -c conda-forge openjdk

Install project in a dev mode (this will also install dev requirements):

pip install -e ".[dev]"

Running unit tests

For unit testing, please use pytest:

pytest tests/unit --cov

Please check the directory tests/unit for more details on how to use unit tests. In the tests/unit/conftest.py you'll also find useful testing primitives, such as local Spark instance with Delta support, local MLflow and DBUtils fixture.

Running integration tests

There are two options for running integration tests:

On an interactive cluster via dbx execute
On a job cluster via dbx launch

For quicker startup of the job clusters we recommend using instance pools (AWS, Azure, GCP).

For an integration test on interactive cluster, use the following command:

dbx execute --cluster-name=<name of interactive cluster> --job=<name of the job to test>

To execute a task inside multitask job, use the following command:

dbx execute \
    --cluster-name=<name of interactive cluster> \
    --job=<name of the job to test> \
    --task=<task-key-from-job-definition>

For a test on an automated job cluster, deploy the job files and then launch:

dbx deploy --jobs=<name of the job to test> --files-only
dbx launch --job=<name of the job to test> --as-run-submit --trace

Please note that for testing we recommend using jobless deployments, so you won't affect existing job definitions.

Interactive execution and development on Databricks clusters

dbx expects that cluster for interactive execution supports %pip and %conda magic commands.
Please configure your job in conf/deployment.yml file.
To execute the code interactively, provide either --cluster-id or --cluster-name.

dbx execute \
    --cluster-name="<some-cluster-name>" \
    --job=job-name

Multiple users also can use the same cluster for development. Libraries will be isolated per each execution context.

Working with notebooks and Repos

To start working with your notebooks from a Repos, do the following steps:

Add your git provider token to your user settings in Databricks
Add your repository to Repos. This could be done via UI, or via CLI command below:

databricks repos create --url <your repo URL> --provider <your-provider>

This command will create your personal repository under /Repos/<username>/structure. 3. Use git_source in your job definition as described here

CI/CD pipeline settings

Please set the following secrets or environment variables for your CI provider:

DATABRICKS_HOST
DATABRICKS_TOKEN

Testing and releasing via CI pipeline

To trigger the CI pipeline, simply push your code to the repository. If CI provider is correctly set, it shall trigger the general testing pipeline
To trigger the release pipeline, get the current version from the structure/__init__.py file and tag the current code version:

git tag -a v<your-project-version> -m "Release tag for version <your-project-version>"
git push origin --tags

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.dbx		.dbx
.github/workflows		.github/workflows
conf		conf
notebooks		notebooks
structure		structure
tests/unit		tests/unit
.coveragerc		.coveragerc
.gitignore		.gitignore
README.md		README.md
pytest.ini		pytest.ini
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

structure

Local environment setup

Running unit tests

Running integration tests

Interactive execution and development on Databricks clusters

Working with notebooks and Repos

CI/CD pipeline settings

Testing and releasing via CI pipeline

About

Releases

Packages

Languages

comercisandro/hackathon

Folders and files

Latest commit

History

Repository files navigation

structure

Local environment setup

Running unit tests

Running integration tests

Interactive execution and development on Databricks clusters

Working with notebooks and Repos

CI/CD pipeline settings

Testing and releasing via CI pipeline

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages