Skip to content

Kaggle challenge - task "What do we know about COVID-19 risk factors?"

Notifications You must be signed in to change notification settings

chopeen/CORD-19

Repository files navigation

COVID-19 Open Research Dataset Challenge (CORD-19)

Kaggle challenge

Challenge: CORD-19

Task: What do we know about COVID-19 risk factors?

Submission: Kaggle notebook

Team

  • Adrianna Safaryn
  • Anna Haratym-Rojek
  • Cezary Szulc
  • Marek Grzenkowicz

Goal

We wanted to use named entity recognition (NER) to highlight names of risk factors (RF). Our goal was training a custom NER model for spaCy, that could later be use to recognize risk factors in medical publications.

RF tags in Prodigy

Pipeline

  1. Data preprocessing to extract 'risk factor(s)' sentences
  2. Manual data annotation in Prodigy
  3. Pretraining different models - we experimented with different base models and trained a number of tok2vec layers to maximize the F-score
    • base models: en_vectors_web_lg, en_core_web_lg, en_core_sci_lg
    • tok2vec layers were trained for: RF sentences, subset of abstracts, all abstracts
  4. Labelling more data by correcting the predictions of the top model trained in the previous step
  5. Go back to step #3 to pretrain a new model using more data and then label even more data
  6. Training the final model with all gathered annotations

Model performance

Each iteration uses all datasets from the previous one and adds more annotations. For detailed information about every trained model, see the notebook train_experiments_2.ipynb.

Base model en_core_sci_lg

Iteration Datasets (data/annotated/) Best F-score
1 cord_19_rf_sentences 53.333
2 above + cord_19_rf_sentences_correct 75.630
3 above + cord_19_rf_sentences_correct_2 74.894
4 above + cord_19_rf_sentences_correct_3 68.770

Base model en_core_sci_md

Iteration Datasets (data/annotated/) Best F-score Download
1 cord_19_rf_sentences 57.778 en_ner_rf_i1_md
2 above + cord_19_rf_sentences_correct 74.380 en_ner_rf_i2_md
3 above + cord_19_rf_sentences_correct_2 74.236 en_ner_rf_i3_md
4 above + cord_19_rf_sentences_correct_3 69.725 en_ner_rf_i4_md

Using a smaller base model (md instead of lg) results in significantly smaller model, while the F-score moves in both directions depending on the iteration.

Packaged models

Medium models for iterations 1..4 can be installed using the download links from the table above.

The directory test/ contains a demo of the models in action (separate Conda environment + notebook).

Key files and resources

Challenges

Data preprocessing

+------------------+           +------------------+           +------------------+
|                  |           |                  |           |                  |          +----------------------------+
|                  |  FILTER   | Abstracts and    |  FILTER   | Sentences that   |          |                            |
| CORD-19 dataset  +---------->| articles that    +---------->| contain phrase   +--------> | cord_19_rf_sentences.jsonl |
|                  |           | mention COVID-19 |           | "risk factor(s)  |          |                            |
|                  |           | and synonyms     |           |                  |          +----------------------------+
+------------------+           +------------------+           +------------------+

Tools

  • Prodigy - text annotation
  • spaCy - NLP and model training
  • scispaCy - specialized spaCy models for biomedical text processing
  • Miniconda - environment setup (you can use conda env create -f environment.yml to set up the Python environment with all packages and models)

Dataset citation

COVID-19 Open Research Dataset (CORD-19). 2020. Version 2020-03-13.
Retrieved from https://pages.semanticscholar.org/coronavirus-research.
Accessed 2020-03-26. doi:10.5281/zenodo.3715506

Notes

  1. Training a NER model with Prodigy and Transfer Learning
  2. When to reject when annotating text for NER?
  3. When should I press accept, reject or ignore?
  4. batch-train is deprecated

About

Kaggle challenge - task "What do we know about COVID-19 risk factors?"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published