Skip to content

Commit

Permalink
readme
Browse files Browse the repository at this point in the history
  • Loading branch information
fgh95 committed Jan 12, 2021
1 parent aa262b5 commit 92079d1
Show file tree
Hide file tree
Showing 11 changed files with 65 additions and 70 deletions.
123 changes: 60 additions & 63 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
# PKDocClassifier
[![License](https://img.shields.io/badge/License-MIT-blue.svg)](https://github.com/fgh95/PKDocClassifier/blob/master/LICENSE)
[![License](https://img.shields.io/badge/License-MIT-blue.svg)](https://github.com/fgh95/PKDocClassifier/blob/master/LICENSE) ![version](https://img.shields.io/badge/version-0.0.1-blue) [![Website shields.io](https://img.shields.io/website-up-down-green-red/http/shields.io.svg)](https://app.pkpdai.com/)


[**PKDocClassifier**](#pkdocclassifier) | [**Reproduce our results**](#reproduce-our-results) | [**Make new predictions**](#make-new-predictions) | [**Citing**](#citation)

Expand All @@ -13,38 +14,34 @@ This repository contains custom pipes and models to classify scientific publicat

You will need and environment with **Python 3.7 or greater**. We strongly recommend that you use an isolated Python environment (such as virtualenv or conda) to install the packages related to this project. Our default option will be to create a virtual environment with conda:

1. If you don't have conda follow the instructions [here](https://conda.io/projects/conda/en/latest/user-guide/install/index.html?highlight=conda#regular-installation)
1. If you don't have anaconda installed follow the instructions [here](https://conda.io/projects/conda/en/latest/user-guide/install/index.html?highlight=conda#regular-installation)

2. Run
2. Create conda environment for this project and activate it

````
conda create -n PKDocClassifier python=3.7
conda activate PKDocClassifier
````
3. Activate it through
````
source activate PKDocClassifier
````
Then, clone and access this repository on your local machine through:
3. Clone and access this repository on your local machine
````
git clone https://github.com/fgh95/PKDocClassifier
cd PKDocClassifier
````
**If you are on MacOSX install LLVM's OpenMP runtime library, e.g.**
````
git clone https://github.com/fgh95/PKDocClassifier
cd PKDocClassifier
````
If you are on MacOSX run:
````
brew install libomp
````
````
brew install libomp
````
Install all dependencies by running:
5. Install all project dependencies
````
pip install .
````
````
pip install .
````
## 2. Data download - Optional
## 2. Data download and parsing - Optional
If you would like to reproduce the steps taken for data retrieval and parsing you will need to download the whole MEDLINE dataset and store it into a spark dataframe.
However, you can also skip this step and use the parsed data available at [data/subsets/](https://github.com/fgh95/PKDocClassifier/tree/master/data/subsets). Alternatively, follow the steps at [pubmed_parser wiki](https://github.com/titipata/pubmed_parser/wiki/Download-and-preprocess-MEDLINE-dataset) and place the resulting `medline_lastview.parquet` file at _data/medline_lastview.parquet_. Then, change the [spark config file](https://github.com/fgh95/PKDocClassifier/blob/master/sparksetup/sparkconf.py) to your spark configuration and run:
Expand All @@ -55,49 +52,49 @@ python getready.py
This should generate the files at [data/subsets/](https://github.com/fgh95/PKDocClassifier/tree/master/data/subsets).
## 3. Run
## 3. Reproduce results
### 3.1. Field analysis and N-grams
3.1.1. To generate the features run (~30min):
````
python features_bow.py
````
3.1.2. Bootstrap field analysis (~3h on 12 threads, requires at least 16GB of RAM)
````
python bootstrap_bow.py \
--input-dir data/encoded/fields \
--output-dir data/results/fields \
--output-dir-bootstrap data/results/fields/bootstrap \
--path-labels data/labels/dev_data.csv
````
3.1.3. Bootstrap n-grams (~3h on 12 threads, requires at least 16GB of RAM)
````
python bootstrap_bow.py \
--input-dir data/encoded/ngrams \
--output-dir data/results/ngrams \
--output-dir-bootstrap data/results/ngrams/bootstrap \
--path-labels data/labels/dev_data.csv
````
3.1.4. Display results
````
python display_results.py \
--input-dir data/results/fields\
--output-dir data/final/fields
````
````
python display_results.py \
--input-dir data/results/ngrams\
--output-dir data/final/ngrams
````
1. To generate the features run (~30min):
````
python features_bow.py
````
2. Bootstrap field analysis (~3h on 12 threads, requires at least 16GB of RAM)
````
python bootstrap_bow.py \
--input-dir data/encoded/fields \
--output-dir data/results/fields \
--output-dir-bootstrap data/results/fields/bootstrap \
--path-labels data/labels/dev_data.csv
````
3. Bootstrap n-grams (~3h on 12 threads, requires at least 16GB of RAM)
````
python bootstrap_bow.py \
--input-dir data/encoded/ngrams \
--output-dir data/results/ngrams \
--output-dir-bootstrap data/results/ngrams/bootstrap \
--path-labels data/labels/dev_data.csv
````
4. Display results
````
python display_results.py \
--input-dir data/results/fields\
--output-dir data/final/fields
````
````
python display_results.py \
--input-dir data/results/ngrams\
--output-dir data/final/ngrams
````
### 3.2. Distributed representations
Expand Down
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
Empty file removed results/__init__.py
Empty file.
2 changes: 1 addition & 1 deletion scripts/bootstrap_bow.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@
import xgboost as xgb
import matplotlib.pyplot as plt
from tqdm import tqdm
from encoders.bootstrap import Tokenizer, TextSelector
from pk_classifier.bootstrap import Tokenizer, TextSelector
import argparse


Expand Down
4 changes: 2 additions & 2 deletions scripts/bootstrap_dist.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,10 +8,10 @@
import xgboost as xgb
import matplotlib.pyplot as plt
from tqdm import tqdm
from encoders.bootstrap import Tokenizer, TextSelector
from pk_classifier.bootstrap import Tokenizer, TextSelector
import argparse
import warnings
from encoders.utils import read_jsonl
from pk_classifier.utils import read_jsonl


def f1_eval(y_pred, dtrain):
Expand Down
2 changes: 1 addition & 1 deletion scripts/display_results.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
from results.stats import get_all_results
from pk_classifier.stats import get_all_results
import os
import argparse

Expand Down
3 changes: 1 addition & 2 deletions scripts/features_dist.py
Original file line number Diff line number Diff line change
@@ -1,9 +1,8 @@
from features_bow import run as run_bow
import os
import pandas as pd
from tqdm import tqdm
from sklearn.pipeline import Pipeline, FeatureUnion
from encoders.utils import make_pipeline, Embedder, ConcatenizerEmb
from pk_classifier.utils import make_pipeline, Embedder, ConcatenizerEmb


def pre_process(inp_path, out_path, field_list, ngrams, maxmin):
Expand Down
1 change: 0 additions & 1 deletion sparksetup/sparkconf.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,6 @@
os.environ['JAVA_HOME'] = '/usr/lib/jvm/java-8-openjdk-amd64'
findspark.init(spark_home="/opt/spark-2.4.4/")


conf = SparkConf(). \
setAppName('main'). \
setMaster('local[*]'). \
Expand Down

0 comments on commit 92079d1

Please sign in to comment.