Clinical Trials Data Pipeline

Overview

This pipeline extracts, transforms, and loads (ETL) clinical trial data from ClinicalTrials.gov. It uses dlt for data ingestion, DuckDB for local storage, and dbt for transforming raw data into a study-centric data mart (clinicaltrials_mart). Additionally, it integrates GPT-3.5 for standardizing clinical conditions, leveraging machine learning for further data refinement.

Components

DLT: Ingests data from the ClinicalTrials.gov API into DuckDB.
DBT: Transforms raw clinical trial data into a refined study data mart (clinicaltrials_mart). This mart aggregates key columns like condition, sponsor_name, investigator_name, adverse_events, and more, using STRING_AGG to handle data from child tables. The clinicaltrials_mart table is the central hub for standardized clinical trial data.
DuckDB: Serves as the local data warehouse, allowing fast querying of both raw and transformed data.
LLM (GPT-3.5): The machine learning integration uses OpenAI's GPT-3.5 API to standardize conditions (diseases) within the data mart. The ml_model.py script processes conditions from clinicaltrials_mart and outputs a standardized version, which can be saved back into DuckDB for further use.

Data Flow

Ingestion: Data is pulled from ClinicalTrials.gov using dlt and stored in DuckDB.
Transformation: Using dbt, raw data is transformed into the clinicaltrials_mart table, which aggregates important study details such as conditions, sponsors, and adverse events.
Standardization: Conditions from clinicaltrials_mart are standardized using GPT-3.5 to ensure consistency and improve downstream data analysis.
Machine Learning: A basic ML model is integrated to standardize the conditions in the clinical trials, preparing the data for further machine learning or analytical tasks.

Setup using a Virtual Environment

Create a virtual environment and install dependencies:

For Linux/Mac:

python -m venv clinical_trials_env && source clinical_trials_env/bin/activate && pip install -r requirements.txt

For Windows:

python -m venv clinical_trials_env && .\clinical_trials_env\Scripts\activate && pip install -r requirements.txt

Configure your OpenAI API key and set the DuckDB path:

export OPENAI_API_KEY="your_openai_api_key"
export DBT_DUCKDB_PATH=$(pwd)"/clinical_trials.duckdb"

Running the Pipeline

Run the data ingestion pipeline, following the fun debugging messages. It's expected to run for 5k records, and if you run it multiple times, it will incrementally ingest records based on the upcoming page token (won't process the same record twice):
```
python dlt_pipeline.py
```
Run DBT transformations. We want it mostly for clinicaltrials_mart Data Mart.
```
dbt run --project-dir dbt --profiles-dir dbt
```

Viewing the Transformed Data using `duckcli`

duckcli allows you to easily query the DuckDB file directly from the terminal.

Start duckcli and connect to the DuckDB file:
```
duckcli clinical_trials.duckdb
```
1.1. To list all tables ingested from the pipeline, and generated by dbt: .tables

1.2. To list our DBT-generated data mart: SELECT * FROM clinicaltrials_mart LIMIT 5;

Using DLT to show data:

dlt pipeline clinical_trials_pipeline show

Testing the `ml_model.py` Script

I must confess I had a hard time having spare OpenAPI API calls, hence couldn't do a throughout test of this part.

Ensure your environment variable OPENAI_API_KEY is set:
```
export OPENAI_API_KEY="your_openai_api_key"
```
Run the ml_model.py script to standardize the conditions in clinicaltrials_mart:
```
python ml_model.py
```

Dockerized Solution

Install Docker:

Linux:

sudo apt-get install docker-ce docker-compose

Build and run the Docker container:
- Linux:
```
sudo systemctl start docker
docker-compose up --build
```
The pipeline will run inside the container for a subset of data. It currently lacks full integration with orchestration tools, and certain pipeline components like dbt run and the machine learning model are not yet fully integrated.

Productionization Steps

Next steps for taking this pipeline to production:

Orchestration:
- Use Airflow, Prefect, or Dagster to manage scheduling and workflows.
- Assess data refresh rates and available resources to scale infrastructure accordingly.
Scalability:
- The project may scale to hundreds of data sources and syncs.
- Transition from DuckDB to a cloud-based warehouse like BigQuery (ideal for Google environments) or Redshift when needed.
Monitoring and Alerts:
- Implement monitoring tools like Prometheus and Grafana for system health.
- Add Slack or PagerDuty alerts for real-time notifications.
- Consider tools like Metaplane for data anomaly detection and observability.
Machine Learning Enhancements:
- Due to time constraints, further ML model improvements were deferred.
- Explore integrating advanced algorithms and tighter pipeline integration in future iterations.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
dbt		dbt
Dockerfile		Dockerfile
README.md		README.md
dlt_pipeline.py		dlt_pipeline.py
docker-compose.yml		docker-compose.yml
ml_model.py		ml_model.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Clinical Trials Data Pipeline

Overview

Components

Data Flow

Setup using a Virtual Environment

Running the Pipeline

Viewing the Transformed Data using `duckcli`

Testing the `ml_model.py` Script

Dockerized Solution

Productionization Steps

About

Releases

Packages

Languages

FirmAI-Research/clinical_trials

Folders and files

Latest commit

History

Repository files navigation

Clinical Trials Data Pipeline

Overview

Components

Data Flow

Setup using a Virtual Environment

Running the Pipeline

Viewing the Transformed Data using duckcli

Testing the ml_model.py Script

Dockerized Solution

Productionization Steps

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Viewing the Transformed Data using `duckcli`

Testing the `ml_model.py` Script

Packages