Code to accompany the paper:
"Empirical evaluation of language modeling to ascertain cancer outcomes from clinical text reports"
Language modeling has become a central tool for modern natural language processing across multiple domains. Here, we evaluated its utility for extracting cancer outcomes data from clinical text reports. This outcomes extraction task is a key rate-limiting step for asking observational cancer research questions intended to promote precision cancer care using large linked clinical and molecular datasets. Traditional medical record annotation is a slow manual process and scaling up this process is critically important to facilitate accurate and fast clinical decision making.
We have previously demonstrated that simple convolutional neural networks (CNNs), trained on a labeled dataset of imaging reports for over 1,000 patients with non-small cell lung cancer, can yield models able to accurately capture key clinical outcomes from each report, including cancer progression/worsening and response/improvement. In the current analysis, we evaluated whether pre-trained Transformer models, with or without domain adaptation using imaging reports from our institution, can improve performance or reduce the volume of training data necessary to yield well-performing models for this document classification task. We did extensive analyses of multiple variants of pre-trained Transformer models considering major modeling factors such as 1) training sample size, 2) classification architecture, 3) language- model fine tuning, 4) classification task, 5) length of text considered, and 6) number of parameters of the Transformer models. We reported the performance results of these models under different considerations.
To get a local copy up and running, follow these simple steps
- python 3.7, check environments.yml for list of needed packages
-
Clone the repo
git clone https://github.com/marakeby/clinicalNLP2.git
-
Create conda environment Note that not all packages are needed to generate the paper figures. Some of these packages are needed only for the training purposes.
conda env create --name cnlp_env --file=environment.yml
-
Based on your use, you may need to download one or more of the following
a. Log files (needed to regenerate paper figures). Extract the files under
_cnlp_results
directory. If you like to store it somewhere else, you may need to set theTEST_RESULTS_PATH
variable inconfig_path.py
accordingly.b. Plots files (a copy of the paper images). Extract the files under
_cnlp_plots
directory. If you like to store it somewhere else, you may need to set thePLOTS_PATH
variable inconfig_path.py
accordingly.
-
Activate the created conda environment
source activate cnlp_env
-
Add the current diretory to PYTHONPATH, e.g.
export PYTHONPATH=~/clinicalNLP2:$PYTHONPATH
-
To generate all paper figures, run
cd ./paper_analysis python generate_figures.py
-
To generate individual paper figure run the different files under the 'paper_analysis_revision2' directory, e.g.
cd ./paper_analysis_revision2 python figure_4_samples_sizes.py
-
To re-train a model from scratch run
cd ./train python run_testing.py
This will run an experiment
bert_classifier/progression_one_split_BERT_sizes_tiny_frozen_tuned
which trains a model to predict progression of cancer patients using a fine-tuned tiny BERT model under different size of the training set. The results of the experiment will be stored under_logs
in a directory with the same name as the experiment. To run another experiment, you may uncomment one of the lines in the run_testing.py to run the corresponding experiment.
Note that the underlying EHR text reports used to train and evaluate NLP models for these analyses constitute protected health information for DFCI patients and therefore cannot be made publicly available. Researchers with DFCI appointments and Institutional Review Board (IRB) approval can access the data on request. For external researchers, access would require collaboration with the authors and eligibility for a DFCI appointment per DFCI policies
Distributed under the GPL-2.0 License License. See LICENSE
for more information.
Haitham - @HMarakeby
Project Link: https://github.com/marakeby/clinicalNLP2
- Elmarakeby, H, et al. "Empirical evaluation of language modeling to ascertain cancer outcomes from clinical text reports"
- Kehl, K. L., Elmarakeby, H., Nishino, M., Van Allen, E. M., Lepisto, E. M., Hassett, M. J., ... & Schrag, D. (2019). Assessment of deep natural language processing in ascertaining oncologic outcomes from radiology reports. JAMA oncology, 5(10), 1421-1429.
- Kehl, K. L., Xu, W., Gusev, A., Bakouny, Z., Choueiri, T. K., Riaz, I. B., ... & Schrag, D. (2021). Artificial intelligence-aided clinical annotation of a large multi-cancer genomic dataset. Nature communications, 12(1), 1-9.
- Kehl, K. L., Xu, W., Lepisto, E., Elmarakeby, H., Hassett, M. J., Van Allen, E. M., ... & Schrag, D. (2020). Natural language processing to ascertain cancer outcomes from medical oncologist notes. JCO Clinical Cancer Informatics, 4, 680-690.
- National Cancer Institute (NCI),
- Doris Duke Charitable Foundation,
- Department of Defense (DoD),
- Mark Foundation Emerging Leader Award,
- PCF-Movember Challenge Award