This pipeline extracts, transforms, and loads (ETL) clinical trial data from ClinicalTrials.gov. It uses dlt
for data ingestion, DuckDB
for local storage, and dbt
for transforming raw data into a study-centric data mart (clinicaltrials_mart
). Additionally, it integrates GPT-3.5 for standardizing clinical conditions, leveraging machine learning for further data refinement.
- DLT: Ingests data from the ClinicalTrials.gov API into DuckDB.
- DBT: Transforms raw clinical trial data into a refined study data mart (
clinicaltrials_mart
). This mart aggregates key columns likecondition
,sponsor_name
,investigator_name
,adverse_events
, and more, usingSTRING_AGG
to handle data from child tables. Theclinicaltrials_mart
table is the central hub for standardized clinical trial data. - DuckDB: Serves as the local data warehouse, allowing fast querying of both raw and transformed data.
- LLM (GPT-3.5): The machine learning integration uses OpenAI's GPT-3.5 API to standardize conditions (diseases) within the data mart. The
ml_model.py
script processes conditions fromclinicaltrials_mart
and outputs a standardized version, which can be saved back into DuckDB for further use.
- Ingestion: Data is pulled from ClinicalTrials.gov using
dlt
and stored in DuckDB. - Transformation: Using
dbt
, raw data is transformed into theclinicaltrials_mart
table, which aggregates important study details such as conditions, sponsors, and adverse events. - Standardization: Conditions from
clinicaltrials_mart
are standardized using GPT-3.5 to ensure consistency and improve downstream data analysis. - Machine Learning: A basic ML model is integrated to standardize the conditions in the clinical trials, preparing the data for further machine learning or analytical tasks.
-
Create a virtual environment and install dependencies:
- For Linux/Mac:
python -m venv clinical_trials_env && source clinical_trials_env/bin/activate && pip install -r requirements.txt
- For Windows:
python -m venv clinical_trials_env && .\clinical_trials_env\Scripts\activate && pip install -r requirements.txt
- For Linux/Mac:
-
Configure your OpenAI API key and set the DuckDB path:
export OPENAI_API_KEY="your_openai_api_key" export DBT_DUCKDB_PATH=$(pwd)"/clinical_trials.duckdb"
-
Run the data ingestion pipeline, following the fun debugging messages. It's expected to run for 5k records, and if you run it multiple times, it will incrementally ingest records based on the upcoming page token (won't process the same record twice):
python dlt_pipeline.py
-
Run DBT transformations. We want it mostly for clinicaltrials_mart Data Mart.
dbt run --project-dir dbt --profiles-dir dbt
duckcli
allows you to easily query the DuckDB file directly from the terminal.
-
Start
duckcli
and connect to the DuckDB file:duckcli clinical_trials.duckdb
1.1. To list all tables ingested from the pipeline, and generated by dbt:
.tables
1.2. To list our DBT-generated data mart:
SELECT * FROM clinicaltrials_mart LIMIT 5;
-
Using DLT to show data:
dlt pipeline clinical_trials_pipeline show
I must confess I had a hard time having spare OpenAPI API calls, hence couldn't do a throughout test of this part.
-
Ensure your environment variable
OPENAI_API_KEY
is set:export OPENAI_API_KEY="your_openai_api_key"
-
Run the ml_model.py script to standardize the conditions in clinicaltrials_mart:
python ml_model.py
-
Install Docker:
- Linux:
sudo apt-get install docker-ce docker-compose
- Linux:
-
Build and run the Docker container:
- Linux:
sudo systemctl start docker docker-compose up --build
The pipeline will run inside the container for a subset of data. It currently lacks full integration with orchestration tools, and certain pipeline components like dbt run and the machine learning model are not yet fully integrated.
- Linux:
Next steps for taking this pipeline to production:
-
Orchestration:
- Use Airflow, Prefect, or Dagster to manage scheduling and workflows.
- Assess data refresh rates and available resources to scale infrastructure accordingly.
-
Scalability:
- The project may scale to hundreds of data sources and syncs.
- Transition from DuckDB to a cloud-based warehouse like BigQuery (ideal for Google environments) or Redshift when needed.
-
Monitoring and Alerts:
- Implement monitoring tools like Prometheus and Grafana for system health.
- Add Slack or PagerDuty alerts for real-time notifications.
- Consider tools like Metaplane for data anomaly detection and observability.
-
Machine Learning Enhancements:
- Due to time constraints, further ML model improvements were deferred.
- Explore integrating advanced algorithms and tighter pipeline integration in future iterations.