Name		Name	Last commit message	Last commit date
parent directory ..
ctgov		ctgov
raw_protocols		raw_protocols
README.md		README.md
preprocess.py		preprocess.py

README.md

Data for developing the Clinical Trial Risk Tool

You only need data in this folder if you are planning on training any further models.

There are two datasets:

1. Manual dataset

This is a set of between 100 and 300 protocols which have been read through individually and annotated with key parameters such as the sample size. The number annotated per parameter varied between 100 and 300.

2. ClinicalTrials.gov dataset

This is a much larger dataset of 11925 protocols downloaded from ClinicalTrials.gov. These came together with NCT ID, phase, pathology, SAP, number of arms and number of subjects, but the data was voluntarily provided by the researchers and in many cases is out of date or inaccurate.

By combining the two datasets, it has been possible to obtain some of the advantages of a large dataset and some of the advantages of a smaller, more accurate dataset.

Downloading the manual dataset

Start Apache Tika (https://tika.apache.org/) running for PDF extraction.
Go into raw_protocols.
Run download_raw_protocols.sh
Run preprocess.py.

Downloading the ClinicalTrials.gov dataset

Start Apache Tika (https://tika.apache.org/) running for PDF extraction.
Go into ctgov/raw_protocols.
Run download_raw_protocols.sh
Run 02_parse_all_PDFs_to_json.ipynb.

Working further with the ClinicalTrials.gov dataset using the Postgres Database Dump

Follow the instructions in ctgov/README.md to download/extract the database dump.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

README.md

Data for developing the Clinical Trial Risk Tool

1. Manual dataset

2. ClinicalTrials.gov dataset

Downloading the manual dataset

Downloading the ClinicalTrials.gov dataset

Working further with the ClinicalTrials.gov dataset using the Postgres Database Dump

Files

data

Directory actions

More options

Directory actions

More options

Latest commit

History

data

Folders and files

parent directory

README.md

Data for developing the Clinical Trial Risk Tool

1. Manual dataset

2. ClinicalTrials.gov dataset

Downloading the manual dataset

Downloading the ClinicalTrials.gov dataset

Working further with the ClinicalTrials.gov dataset using the Postgres Database Dump