This repository contains the code necessary to recapitulate the results of the AMIA summit paper, Chemical Entity Recognition for MEDLINE Indexing.
Included here is:
- The manually annotated ChEMFAM corpus in BRAT format
- Text files of all of the annotated articles
- Raw annotations (in assorted formats) of all tools run in the paper
- Annotation sets generated by tools
- Code for fine-tuning and running BERT and XLNet models
- Instructions for running PubTator Central and ChemDataExtractor
- Scripts for performing the evaluation.
Included is the ChEMFAM corpus, located in the data/ChEMFAM_corpus directory. The .ann files (BRAT format) and the .txt files of the articles have been shared. In the .txt files, the first line contains the title of the article and the second line contains the abstract.
The guidelines for annotations are available as a .docx file, ChEMFAM_Annotation_Guidelines.docx.
To clone this repository with the BERT and XLNet submodules (necessary if you want to train the models)
git clone --recurse-submodules https://github.com/saverymax/CER-for-MTI.git
BERT and XLNet models can be downloaded with
bash download_models.sh
You will need about 5GB of disk space.
All experiments were run in Ubuntu 16.04. A GPU with 10GB of memory is required to train the models.
Before training and running evaluation, it is recommended to create a virtual python environment, with python 3.6.8.
For example
conda create --name chemfam_env python=3.6.8
The dependencies can be installed with
pip install -r requirements.txt
This will install the following packages:
tf_metrics
sentencepiece
leven
tensorflow-gpu version 1.12.2
numpy version 1.16.1
Note: If the machine you are installing this on does not have a GPU, the installation of tf_metrics will interfere with the installation of tensorflow-gpu 1.12.2, as tf_metrics will attempt to download the most recent version of tensorflow (cpu), which was not tested with the BERT and XLNet python modules.
To recreate the results from the paper, download the models, install the dependencies, and run the following commands.
bash train_models.sh
bash run_models.sh
python run_tool_evalutation.py
The commands are explained in further detail below.
If you don't want to do any training and just want to see the results from the annotations provided in the data/tool_annotations directory, run
python run_tool_evalutation.py
The output of run_tool_evaluation.py is explained in the Evaluation section.
To generate the results yourself, the models must be trained on entity mentions and run on the ChEMFAM corpus.
To train all models:
train_models.sh
To train specific types of models, see below.
There are two training datasets included here, coverted into BIO format:
- The BioCreative II dataset containing gene and gene product mentions.
- The BioCreative IV CHEMDNER dataset containing chemical entity mentions. The training and development datasets were merged for training the models in this project
These can be found at https://biocreative.bioinformatics.udel.edu/resources/
If you have access to the datasets, they can be converted into BIO with the convert_GM2BIO.py and convert_chemdner2BIO.py scripts, located in the data/training_data directory. This will require installing ChemListem which will also install the excellent chemical tokenizer module, chemtok.
pip install chemlistem
Refer to the scripts for more information.
To train just the bert models, run the train_bert.sh script. This will generate BERT, SciBERT, and BioBERT models, trained on the BC4CHEMD and BC2GM data (one model trained on one dataset, six models total).
The code for NER in the repository https://github.com/kyzhouhzau/BERT-NER/tree/master/old_version was used as reference to write the BERT_annotator.py script.
To train the XLNet models, run the train_xlnet.sh script.
The code for NER in the repository https://github.com/stevezheng23/xlnet_extension_tf was used as reference to write the XLNet_annotator.py script.
To run all CER systems on the ChEMFAM corpus, run the run_models.sh script. Instructions for individual models are below.
ChemDataExtractor can be installed and imported into Python.
pip install ChemDataExtractor
The run_ChemDataExtractor.py script will run the system on the text, generating annotations for each article.
Pubtator can be accessed at https://www.ncbi.nlm.nih.gov/research/pubtator/index.html. Upload the pmids_to_annotate.txt file to the collection manager, and download the results in PubTator format, placing them in the tool_annotations directory.
All BERT models, including SciBERT and BioBERT, can be run with the run_bert.sh script. This will generate predictions for chemicals in the ChEMFAM corpus.
XLNet models can be run with the run_xlnet.sh script.
At this time there is no simple way to recapitulate the results of MetaMapLite or MTI. While these tools have open source implementations, the results for this paper were generated using in-house modifications.
No code is provided to run these models. However, the code can be found at https://bitbucket.org/rscapplications/chemlistem/src/master/ and https://github.com/guillaumegenthial/tf_ner. Additionally, there are many open source implementations of these types of LSTM/CNN/CRF models for NER and CER.
After train_models.sh and run_models.sh have been run, or the individual models above have been trained and run, the run_tool_evaluation.py file can be used to run the evaluation. This will use the annotations from all tools to calculate F1-score, recall, and precision. Including the -b option will run bootstrap to compute standard errors. Including the -l option will evaluate the annotations using the Levenshtein metric, for inexact matching.
The results for each model can be viewed in the results_printouts directory. The results will be saved to one of four files, depending on the CLI options used:
- results_tool_evaluation.txt for results calculated using exact matching
- results_tool_evaluation_bootstrap.txt for results calculated using exact matching and bootstrap to generate the standard error
- results_tool_evaluation_leven.txt for results calculated using relaxed matching criteria (levenshtein distance normalized by string length)
- results_tool_evaluation_leven_bootstrap.txt for standard errors of relazed matching results
Additionally, annotation sets for each tool can be found in the data/annotation_sets directory. If the -l option has been used, Levenshtein measurements for each tool for each entity can be found in result_printouts/levenshtein_measurements.txt
https://github.com/google-research/bert
https://github.com/zihangdai/xlnet
https://github.com/stevezheng23/xlnet_extension_tf
https://github.com/kyzhouhzau/BERT-NER