Name	Name	Last commit message	Last commit date
Latest commit History 59 Commits
data	data
pk_classifier	pk_classifier
scripts	scripts
sparksetup	sparksetup
specter	specter
tests	tests
.gitignore	.gitignore
LICENSE	LICENSE
README.md	README.md
environment.yml	environment.yml
requirements.txt	requirements.txt
setup.py	setup.py

PKDocClassifier

PKDocClassifier| Data | Reproduce our results | User our model | Citing

This repository contains custom pipes and models to classify scientific publications from PubMed depending on whether they estimate pharmacokinetic (PK) parameters from in vivo studies. The final pipeline retrieved more than 121K PK publications and runs weekly updates available at https://app.pkpdai.com/.

Data

The labels assigned to each publication in the training and test sets are available in CSV format at the labels folder. We also provide the textual fields from each publication after being parsed at the subsets folder.

Reproduce our results

1. Installing dependencies

You will need an environment with Python 3.7+. We strongly recommend that you use an isolated Python environment (such as virtualenv or conda) to install the packages related to this project. Our default option will be to create a virtual environment with conda:

If you don't have anaconda installed follow the instructions here

Create conda environment for this project and activate it

conda create -n PKDocClassifier python=3.7
conda activate PKDocClassifier

Clone and access this repository on your local machine
```
git clone https://github.com/fgh95/PKDocClassifier
cd PKDocClassifier
```
If you are on MacOSX install LLVM's OpenMP runtime library, e.g.
```
brew install libomp
```
Install all project dependencies
```
pip install .
```

2. Data download and parsing - Optional

If you would like to reproduce the steps taken for data retrieval and parsing you will need to download the whole MEDLINE dataset and store it into a spark dataframe. However, you can also skip this step and use the parsed data available at data/subsets/. Alternatively, follow the steps at pubmed_parser wiki and place the resulting medline_lastview.parquet file at data/medline_lastview.parquet. Then, change the spark config file to your spark configuration and run:

python getready.py

This should generate the files at data/subsets/.

3. Reproduce results

3.1. Field analysis and N-grams

To generate the features run (~30min):
```
python scripts/features_bow.py
```

Bootstrap field analysis (~3h on 12 threads, requires at least 16GB of RAM, set overwrite to False if you want to skip this step)

python scripts/bootstrap_bow.py \
   --input-dir data/encoded/fields \
   --output-dir data/results/fields \
   --output-dir-bootstrap data/results/fields/bootstrap \
   --path-labels data/labels/dev_data.csv \
   --overwrite True

Bootstrap n-grams (set overwrite to False if you want to skip this step)

python scripts/bootstrap_bow.py \
   --input-dir data/encoded/ngrams \
   --output-dir data/results/ngrams \
   --output-dir-bootstrap data/results/ngrams/bootstrap \
   --path-labels data/labels/dev_data.csv \
   --overwrite True

Display results

python scripts/display_results.py \
   --input-dir  data/results/fields\
   --output-dir data/final/fields

python scripts/display_results.py \
   --input-dir  data/results/ngrams\
   --output-dir data/final/ngrams

3.2. Distributed representations

Encode using SPECTER. To generate the features with specter you can preprocess the data running:

python preprocess_specter.py

This will generate the following input data as .ids and .json files at data/encoded/specter/. Finally, to generate the input features you will need to clone the SPECTER repo and follow the instructions on how to use the pretrained model. After cloning and installing SPECTER dependencies we ran the following command from the specter directory to encode the documents:

python scripts/embed.py \
   --ids ../data/encoded/specter/dev_ids.ids --metadata ../data/encoded/specter/dev_meta.json \
   --model ./model.tar.gz \
   --output-file ../data/encoded/specter/dev_specter.jsonl \
   --vocab-dir data/vocab/ \
   --batch-size 16 \
   --cuda-device -1

python scripts/embed.py \
   --ids ../data/encoded/specter/test_ids.ids --metadata ../data/encoded/specter/test_meta.json \
   --model ./model.tar.gz \
   --output-file ../data/encoded/specter/test_specter.jsonl \
   --vocab-dir data/vocab/ \
   --batch-size 16 \
   --cuda-device -1

This should output two files in the data directory: /data/encoded/specter/dev_specter.jsonl and data/encoded/specter/test_specter.jsonl

Generate BioBERT representations:
```
python scripts/features_dist.py
```

Run bootstrap iterations for distributed representations (set overwrite to False if you want to skip this step):

python scripts/bootstrap_dist.py \
   --is-specter True \
   --use-bow False \
   --input-dir data/encoded/specter \
   --output-dir data/results/distributional \
   --output-dir-bootstrap data/results/distributional/bootstrap \
   --path-labels data/labels/dev_data.csv \
   --path-optimal-bow data/encoded/ngrams/dev_unigrams.parquet \
   --overwrite True

python scripts/bootstrap_dist.py \
   --is-specter False \
   --use-bow False \
   --input-dir data/encoded/biobert \
   --output-dir data/results/distributional \
   --output-dir-bootstrap data/results/distributional/bootstrap \
   --path-labels data/labels/dev_data.csv \
   --path-optimal-bow data/encoded/ngrams/dev_unigrams.parquet \
   --overwrite True

Display results

python scripts/display_results.py \
   --input-dir  data/results/distributional \
   --output-dir data/final/distributional \
   --convert-latex

python scripts/display_results.py \
   --input-dir  data/results/distributional/bow_and_distributional \
   --output-dir data/final/distributional/bow_and_distributional \
   --convert-latex

From these plots we can see that the best-performing architecture on the training data, on average, is the one using average embeddings from BioBERT and unigram features.

3.3. Final pipeline

Run the cross-validation analyses:

python scripts/cross_validate.py \
   --training-embeddings  data/encoded/biobert/dev_biobert_avg.parquet \
   --training-optimal-bow  data/encoded/ngrams/dev_unigrams.parquet \
   --training-labels  data/labels/dev_data.csv\
   --output-dir  data/results/final-pipeline \

Train the final pipeline (preprocessing, encoding, decoding) from scratch with optimal hyperparameters and apply it to the test set:

python scripts/train_test_final.py \
   --path-train  data/subsets/dev_subset.parquet \
   --train-labels  data/labels/dev_data.csv \
   --path-test  data/subsets/test_subset.parquet \
   --test-labels  data/labels/test_data.csv \
   --cv-dir  data/results/final-pipeline \
   --output-dir  data/results/final-pipeline \
   --train-pipeline  True

Final results on the test set should be printed on the terminal.

User our model

See a toy example on how to make new predictions here.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PKDocClassifier

Data

Reproduce our results

1. Installing dependencies

2. Data download and parsing - Optional

3. Reproduce results

3.1. Field analysis and N-grams

3.2. Distributed representations

3.3. Final pipeline

User our model

Citation

About

Releases 1

Packages

Contributors 2

Languages

License

PKPDAI/PKDocClassifier

Folders and files

Latest commit

History

Repository files navigation

PKDocClassifier

Data

Reproduce our results

1. Installing dependencies

2. Data download and parsing - Optional

3. Reproduce results

3.1. Field analysis and N-grams

3.2. Distributed representations

3.3. Final pipeline

User our model

Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 2

Languages

Packages