This project demonstrates a data quality pipeline using Nessie for managing data versioning and validation.
- Dataset: For more information about the dataset used in this project, click here.
- Batch Creation: Batches were created using
./utils/sampler.py
, which samples and prepares the data for processing.sampled_data_1
contains clean data, whilesampled_data_2
contains corrupted data.
- bash-logs: To follow the pipeline process, check the log files.
- unique_keys: is generated for quantative data spelling validation and cleaning.
- Bash Script: The bash script is designed to process and analyze ONLY the corrupted data CSV file to identify data quality issues.
- Data Corruption Module: You can find the data corruption module, which is used to intentionally corrupt datasets for testing purposes, at this GitHub repository. Note that it's still under development and may not be fully stable.
To run the project, simply execute sh start.sh
. Ensure that Docker is installed and properly configured on your system.