The goal of this project is to implement a data validation library for PySpark. The library should detect the incorrect structure of the data, unexpected values in columns, and anomalies in the data.
from checkengine.validate_df import ValidateSparkDataFrame
result = ValidateSparkDataFrame(spark_session, spark_data_frame) \
.is_not_null("column_name") \
.are_not_null(["column_name_2", "column_name_3"]) \
.is_min("numeric_column", 10) \
.is_max("numeric_column", 20) \
.is_unique("column_name") \
.are_unique(["column_name_2", "column_name_3"]) \
.is_between("numeric_column_2", 10, 15) \
.has_length_between("text_column", 0, 10) \
.mean_column_value("numeric_column", 10, 20) \
.median_column_value("numeric_column", 5, 15) \
.text_matches_regex("text_column", "^[a-z]{3,10}$") \
.one_of("text_column", ["value_a", "value_b"]) \
.one_of("numeric_column", [123, 456]) \
result.correct_data #rows that passed the validation
result.erroneous_data #rows rejected during the validation
results.errors a summary of validation errors (three fields: column_name, constraint_name, number_of_errors)
Install the Poetry build tool.
Run the following commands:
cd check-engine-lib
poetry build
cd check-engine-lib
poetry run pytest tests/
cd check-engine-lib
poetry run pytest tests/
cd check-engine-lib
poetry run pytest tests/ -k 'test_should_return_df_without_changes_if_all_are_between'
docker build -t check-engine-test check-engine-lib/. && docker run check-engine-test