Skip to content

ahmadMuhammadGd/Data-Quality-with-Nessie

Repository files navigation

Data-Quality-with-Nessie

This project demonstrates a data quality pipeline using Nessie for managing data versioning and validation.

  • Dataset: For more information about the dataset used in this project, click here.
  • Batch Creation: Batches were created using ./utils/sampler.py, which samples and prepares the data for processing.
    • sampled_data_1 contains clean data, while sampled_data_2 contains corrupted data.
  • bash-logs: To follow the pipeline process, check the log files.
  • unique_keys: is generated for quantative data spelling validation and cleaning.
  • Bash Script: The bash script is designed to process and analyze ONLY the corrupted data CSV file to identify data quality issues.
  • Data Corruption Module: You can find the data corruption module, which is used to intentionally corrupt datasets for testing purposes, at this GitHub repository. Note that it's still under development and may not be fully stable.

How to Run

To run the project, simply execute sh start.sh. Ensure that Docker is installed and properly configured on your system.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published