Name		Name	Last commit message	Last commit date
Latest commit History 71 Commits
.vscode		.vscode
airflow		airflow
bash-logs		bash-logs
config		config
data/original_dataset		data/original_dataset
dockerfiles/spark		dockerfiles/spark
shell-scripts		shell-scripts
spark-container		spark-container
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
docker-compose.yaml		docker-compose.yaml
requirements.txt		requirements.txt

Repository files navigation

Data-Quality-with-Nessie

This project demonstrates a data quality pipeline using Nessie for managing data versioning and validation.

Dataset: For more information about the dataset used in this project, click here.
Batch Creation: Batches were created using ./utils/sampler.py, which samples and prepares the data for processing.
- sampled_data_1 contains clean data, while sampled_data_2 contains corrupted data.
bash-logs: To follow the pipeline process, check the log files.
unique_keys: is generated for quantative data spelling validation and cleaning.
Bash Script: The bash script is designed to process and analyze ONLY the corrupted data CSV file to identify data quality issues.
Data Corruption Module: You can find the data corruption module, which is used to intentionally corrupt datasets for testing purposes, at this GitHub repository. Note that it's still under development and may not be fully stable.

To run the project, simply execute sh start.sh. Ensure that Docker is installed and properly configured on your system.