COVID-19 Open Research Dataset Challenge (CORD-19)

Kaggle challenge

Challenge: CORD-19

Task: What do we know about COVID-19 risk factors?

Team

Adrianna Safaryn
Anna Haratym-Rojek
Cezary Szulc
Marek Grzenkowicz

Goal

We wanted to use named entity recognition (NER) to highlight names of risk factors (RF). Our goal was training a custom NER model for spaCy, that could later be use to recognize risk factors in medical publications.

Pipeline

Data preprocessing to extract 'risk factor(s)' sentences
Manual data annotation in Prodigy
Pretraining different models - we experimented with different base models and trained a number of tok2vec layers to maximize the F-score
- base models: en_vectors_web_lg, en_core_web_lg, en_core_sci_lg
- tok2vec layers were trained for: RF sentences, subset of abstracts, all abstracts
Labelling more data by correcting the predictions of the top model trained in the previous step
Go back to step #3 to pretrain a new model using more data and then label even more data
Training the final model with all gathered annotations

Model performance

Each iteration uses all datasets from the previous one and adds more annotations. For detailed information about every trained model, see the notebook train_experiments_2.ipynb.

Base model `en_core_sci_lg`

Iteration	Datasets (data/annotated/)	Best F-score
1	`cord_19_rf_sentences`	53.333
2	above + `cord_19_rf_sentences_correct`	75.630
3	above + `cord_19_rf_sentences_correct_2`	74.894
4	above + `cord_19_rf_sentences_correct_3`	68.770

Base model `en_core_sci_md`

Iteration	Datasets (data/annotated/)	Best F-score	Download
1	`cord_19_rf_sentences`	57.778	en_ner_rf_i1_md
2	above + `cord_19_rf_sentences_correct`	74.380	en_ner_rf_i2_md
3	above + `cord_19_rf_sentences_correct_2`	74.236	en_ner_rf_i3_md
4	above + `cord_19_rf_sentences_correct_3`	69.725	en_ner_rf_i4_md

Using a smaller base model (md instead of lg) results in significantly smaller model, while the F-score moves in both directions depending on the iteration.

Packaged models

Medium models for iterations 1..4 can be installed using the download links from the table above.

The directory test/ contains a demo of the models in action (separate Conda environment + notebook).

Key files and resources

Training of tok2vec layers: Kaggle notebook
Full set of annotations:
- cord_19_rf_sentences_merged.jsonl (dump of the Prodigy dataset)
- cord_19_rf_sentences_merged.json (spaCy JSON format)
Log of all experiments (including data annotation and model training): train_experiments_2.ipynb
Early experiments: train_experiments_1.ipynb

Challenges

Detailed discussion posted at the Kaggle forum: Custom NER model to recognize risk factor names
Question posted to the Prodigy support forum: Annotating compound entity phrases

Data preprocessing

+------------------+           +------------------+           +------------------+
|                  |           |                  |           |                  |          +----------------------------+
|                  |  FILTER   | Abstracts and    |  FILTER   | Sentences that   |          |                            |
| CORD-19 dataset  +---------->| articles that    +---------->| contain phrase   +--------> | cord_19_rf_sentences.jsonl |
|                  |           | mention COVID-19 |           | "risk factor(s)  |          |                            |
|                  |           | and synonyms     |           |                  |          +----------------------------+
+------------------+           +------------------+           +------------------+

Tools

Prodigy - text annotation
spaCy - NLP and model training
scispaCy - specialized spaCy models for biomedical text processing
Miniconda - environment setup (you can use conda env create -f environment.yml to set up the Python environment with all packages and models)

Dataset citation

COVID-19 Open Research Dataset (CORD-19). 2020. Version 2020-03-13.
Retrieved from https://pages.semanticscholar.org/coronavirus-research.
Accessed 2020-03-26. doi:10.5281/zenodo.3715506

Name		Name	Last commit message	Last commit date
Latest commit History 123 Commits
backup		backup
data		data
images		images
models		models
package/meta		package/meta
test		test
.condaauto		.condaauto
.gitignore		.gitignore
GPU_SUPPORT.md		GPU_SUPPORT.md
README.md		README.md
environment.yml		environment.yml
environment_gpu.yml		environment_gpu.yml
pretrain_tok2vec_sci.ipynb		pretrain_tok2vec_sci.ipynb
pretrain_tok2vec_vectors.ipynb		pretrain_tok2vec_vectors.ipynb
prodigy.json		prodigy.json
train_experiments_2.ipynb		train_experiments_2.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

COVID-19 Open Research Dataset Challenge (CORD-19)

Kaggle challenge

Team

Goal

Pipeline

Model performance

Base model `en_core_sci_lg`

Base model `en_core_sci_md`

Packaged models

Key files and resources

Challenges

Data preprocessing

Tools

Dataset citation

Notes

About

Releases

Packages

Contributors 2

Languages

chopeen/CORD-19

Folders and files

Latest commit

History

Repository files navigation

COVID-19 Open Research Dataset Challenge (CORD-19)

Kaggle challenge

Team

Goal

Pipeline

Model performance

Base model en_core_sci_lg

Base model en_core_sci_md

Packaged models

Key files and resources

Challenges

Data preprocessing

Tools

Dataset citation

Notes

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Base model `en_core_sci_lg`

Base model `en_core_sci_md`

Packages