https://data-management-python.readthedocs.io
This repository contains the core Python library developed and maintained by the NIHR Imperial BRC Genomics Facility for managing raw and processed genomic datasets efficiently.
1. Metadata Management
- Utilizes an extended ENA metadata model for managing information about:
- Projects
- Samples
- Sequencing runs
- Analysis
- File paths and
- Pipeline instances
2. Genomic Sequencing Runs Processing
- Tracks ongoing sequencing runs and initiates processing upon completion.
- Generates summary reports and sends email notifications to users.
3. Analysis Pipelines
- Includes wrappers for both community-developed and vendor-provided data pipelines.
- Automates:
- Configuration generation
- Input formatting
- Executes external pipelines on HPC using bash script wrappers.
- Manages post-processing, including:
- Custom report generation
- Analysis data validation
• Python v3.9
1. Clone the Repository
git clone https://github.com/imperial-genomics-facility/data-management-python.git
2. Install Dependencies Install required Python libraries:
pip install -r requirements_2.6.2.txt # For compatibility with Apache Airflow v2.6.2
3. Update PYTHONPATH Add the core library path to PYTHONPATH:
export PYTHONPATH=/PATH/data-management-python
This project is licensed under the Apache-2.0 License. See the LICENSE file for details.