Skip to content

This ETL pipeline extracts, transforms, and loads clinical trial data from ClinicalTrials.gov using dlt for ingestion, DuckDB for local storage, and dbt for data transformations into a study-centric data mart. GPT-3.5 is integrated for clinical condition standardization, and machine learning enhances data refinement.

Notifications You must be signed in to change notification settings

FirmAI-Research/clinical_trials

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Clinical Trials Data Pipeline

Overview

This pipeline extracts, transforms, and loads (ETL) clinical trial data from ClinicalTrials.gov. It uses dlt for data ingestion, DuckDB for local storage, and dbt for transforming raw data into a study-centric data mart (clinicaltrials_mart). Additionally, it integrates GPT-3.5 for standardizing clinical conditions, leveraging machine learning for further data refinement.

Components

  • DLT: Ingests data from the ClinicalTrials.gov API into DuckDB.
  • DBT: Transforms raw clinical trial data into a refined study data mart (clinicaltrials_mart). This mart aggregates key columns like condition, sponsor_name, investigator_name, adverse_events, and more, using STRING_AGG to handle data from child tables. The clinicaltrials_mart table is the central hub for standardized clinical trial data.
  • DuckDB: Serves as the local data warehouse, allowing fast querying of both raw and transformed data.
  • LLM (GPT-3.5): The machine learning integration uses OpenAI's GPT-3.5 API to standardize conditions (diseases) within the data mart. The ml_model.py script processes conditions from clinicaltrials_mart and outputs a standardized version, which can be saved back into DuckDB for further use.

Data Flow

  1. Ingestion: Data is pulled from ClinicalTrials.gov using dlt and stored in DuckDB.
  2. Transformation: Using dbt, raw data is transformed into the clinicaltrials_mart table, which aggregates important study details such as conditions, sponsors, and adverse events.
  3. Standardization: Conditions from clinicaltrials_mart are standardized using GPT-3.5 to ensure consistency and improve downstream data analysis.
  4. Machine Learning: A basic ML model is integrated to standardize the conditions in the clinical trials, preparing the data for further machine learning or analytical tasks.

Setup using a Virtual Environment

  1. Create a virtual environment and install dependencies:

    • For Linux/Mac:
      python -m venv clinical_trials_env && source clinical_trials_env/bin/activate && pip install -r requirements.txt
      
    • For Windows:
      python -m venv clinical_trials_env && .\clinical_trials_env\Scripts\activate && pip install -r requirements.txt
      
  2. Configure your OpenAI API key and set the DuckDB path:

    export OPENAI_API_KEY="your_openai_api_key"
    export DBT_DUCKDB_PATH=$(pwd)"/clinical_trials.duckdb"
    

Running the Pipeline

  1. Run the data ingestion pipeline, following the fun debugging messages. It's expected to run for 5k records, and if you run it multiple times, it will incrementally ingest records based on the upcoming page token (won't process the same record twice):

    python dlt_pipeline.py
    
  2. Run DBT transformations. We want it mostly for clinicaltrials_mart Data Mart.

    dbt run --project-dir dbt --profiles-dir dbt
    

Viewing the Transformed Data using duckcli

duckcli allows you to easily query the DuckDB file directly from the terminal.

  1. Start duckcli and connect to the DuckDB file:

    duckcli clinical_trials.duckdb
    

    1.1. To list all tables ingested from the pipeline, and generated by dbt: .tables

    1.2. To list our DBT-generated data mart: SELECT * FROM clinicaltrials_mart LIMIT 5;

  2. Using DLT to show data:

    dlt pipeline clinical_trials_pipeline show
    

Testing the ml_model.py Script

I must confess I had a hard time having spare OpenAPI API calls, hence couldn't do a throughout test of this part.

  1. Ensure your environment variable OPENAI_API_KEY is set:

    export OPENAI_API_KEY="your_openai_api_key"
    
  2. Run the ml_model.py script to standardize the conditions in clinicaltrials_mart:

    python ml_model.py
    

Dockerized Solution

  1. Install Docker:

    • Linux:
      sudo apt-get install docker-ce docker-compose
      
  2. Build and run the Docker container:

    • Linux:
      sudo systemctl start docker
      docker-compose up --build
      

    The pipeline will run inside the container for a subset of data. It currently lacks full integration with orchestration tools, and certain pipeline components like dbt run and the machine learning model are not yet fully integrated.

Productionization Steps

Next steps for taking this pipeline to production:

  1. Orchestration:

    • Use Airflow, Prefect, or Dagster to manage scheduling and workflows.
    • Assess data refresh rates and available resources to scale infrastructure accordingly.
  2. Scalability:

    • The project may scale to hundreds of data sources and syncs.
    • Transition from DuckDB to a cloud-based warehouse like BigQuery (ideal for Google environments) or Redshift when needed.
  3. Monitoring and Alerts:

    • Implement monitoring tools like Prometheus and Grafana for system health.
    • Add Slack or PagerDuty alerts for real-time notifications.
    • Consider tools like Metaplane for data anomaly detection and observability.
  4. Machine Learning Enhancements:

    • Due to time constraints, further ML model improvements were deferred.
    • Explore integrating advanced algorithms and tighter pipeline integration in future iterations.

About

This ETL pipeline extracts, transforms, and loads clinical trial data from ClinicalTrials.gov using dlt for ingestion, DuckDB for local storage, and dbt for data transformations into a study-centric data mart. GPT-3.5 is integrated for clinical condition standardization, and machine learning enhances data refinement.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published