readme

PKPDAI · Jan 12, 2021 · 92079d1 · 92079d1
1 parent aa262b5
commit 92079d1
Show file tree

Hide file tree

Showing 11 changed files with 65 additions and 70 deletions.
diff --git a/README.md b/README.md
@@ -1,5 +1,6 @@
 # PKDocClassifier
-[![License](https://img.shields.io/badge/License-MIT-blue.svg)](https://github.com/fgh95/PKDocClassifier/blob/master/LICENSE)
+[![License](https://img.shields.io/badge/License-MIT-blue.svg)](https://github.com/fgh95/PKDocClassifier/blob/master/LICENSE) ![version](https://img.shields.io/badge/version-0.0.1-blue) [![Website shields.io](https://img.shields.io/website-up-down-green-red/http/shields.io.svg)](https://app.pkpdai.com/)
+
 
 [**PKDocClassifier**](#pkdocclassifier) | [**Reproduce our results**](#reproduce-our-results) | [**Make new predictions**](#make-new-predictions) | [**Citing**](#citation)
 
@@ -13,38 +14,34 @@ This repository contains custom pipes and models to classify scientific publicat
 
 You will need and environment with **Python 3.7 or greater**. We strongly recommend that you use an isolated Python environment (such as virtualenv or conda) to install the packages related to this project. Our default option will be to create a virtual environment with conda:
 
-1. If you don't have conda follow the instructions [here](https://conda.io/projects/conda/en/latest/user-guide/install/index.html?highlight=conda#regular-installation)
+1. If you don't have anaconda installed follow the instructions [here](https://conda.io/projects/conda/en/latest/user-guide/install/index.html?highlight=conda#regular-installation)
 
-2. Run 
+2. Create conda environment for this project and activate it 
 
     ````
    conda create -n PKDocClassifier python=3.7
+   conda activate PKDocClassifier
     ````
 
-3. Activate it through
-    ````
-   source activate PKDocClassifier
-    ````
-
-Then, clone and access this repository on your local machine through:
+3. Clone and access this repository on your local machine 
+   
+   ````
+   git clone https://github.com/fgh95/PKDocClassifier
+   cd PKDocClassifier
+   ````
+   **If you are on MacOSX install LLVM's OpenMP runtime library, e.g.** 
 
-````
-git clone https://github.com/fgh95/PKDocClassifier
-cd PKDocClassifier
-````
-If you are on MacOSX run: 
-
-````
-brew install libomp
-````
+   ````
+   brew install libomp
+   ````
 
-Install all dependencies by running: 
+5. Install all project dependencies
 
-````
-pip install .
-````
+   ````
+   pip install .
+   ````
 
-## 2. Data download - Optional
+## 2. Data download and parsing - Optional
 
 If you would like to reproduce the steps taken for data retrieval and parsing you will need to download the whole MEDLINE dataset and store it into a spark dataframe. 
 However, you can also skip this step and use the parsed data available at [data/subsets/](https://github.com/fgh95/PKDocClassifier/tree/master/data/subsets). Alternatively, follow the steps at [pubmed_parser wiki](https://github.com/titipata/pubmed_parser/wiki/Download-and-preprocess-MEDLINE-dataset) and place the resulting `medline_lastview.parquet` file at _data/medline_lastview.parquet_. Then, change the [spark config file](https://github.com/fgh95/PKDocClassifier/blob/master/sparksetup/sparkconf.py) to your spark configuration and run:
@@ -55,49 +52,49 @@ python getready.py
 
 This should generate the files at [data/subsets/](https://github.com/fgh95/PKDocClassifier/tree/master/data/subsets).
 
-## 3. Run
+## 3. Reproduce results
 
 ### 3.1. Field analysis and N-grams
 
-3.1.1. To generate the features run (~30min):
-
-````
-python features_bow.py
-````
-
-3.1.2. Bootstrap field analysis (~3h on 12 threads, requires at least 16GB of RAM)
-
-````
-python bootstrap_bow.py \
-    --input-dir data/encoded/fields \
-    --output-dir data/results/fields \
-    --output-dir-bootstrap data/results/fields/bootstrap \
-    --path-labels data/labels/dev_data.csv
-````
-
-3.1.3. Bootstrap n-grams (~3h on 12 threads, requires at least 16GB of RAM)
-
-````
-python bootstrap_bow.py \
-    --input-dir data/encoded/ngrams \
-    --output-dir data/results/ngrams \
-    --output-dir-bootstrap data/results/ngrams/bootstrap \
-    --path-labels data/labels/dev_data.csv
-````
-
-3.1.4. Display results
-
-````
-python display_results.py \
-    --input-dir  data/results/fields\
-    --output-dir data/final/fields
-````
-
-````
-python display_results.py \
-    --input-dir  data/results/ngrams\
-    --output-dir data/final/ngrams
-````
+1. To generate the features run (~30min):
+
+   ````
+   python features_bow.py
+   ````
+
+2. Bootstrap field analysis (~3h on 12 threads, requires at least 16GB of RAM)
+
+   ````
+   python bootstrap_bow.py \
+       --input-dir data/encoded/fields \
+       --output-dir data/results/fields \
+       --output-dir-bootstrap data/results/fields/bootstrap \
+       --path-labels data/labels/dev_data.csv
+   ````
+
+3. Bootstrap n-grams (~3h on 12 threads, requires at least 16GB of RAM)
+
+   ````
+   python bootstrap_bow.py \
+       --input-dir data/encoded/ngrams \
+       --output-dir data/results/ngrams \
+       --output-dir-bootstrap data/results/ngrams/bootstrap \
+       --path-labels data/labels/dev_data.csv
+   ````
+
+4. Display results
+
+   ````
+   python display_results.py \
+       --input-dir  data/results/fields\
+       --output-dir data/final/fields
+   ````
+
+   ````
+   python display_results.py \
+       --input-dir  data/results/ngrams\
+       --output-dir data/final/ngrams
+   ````
 
 ### 3.2. Distributed representations
 

diff --git a/encoders/__init__.py → pk_classifier/__init__.py b/encoders/__init__.py → pk_classifier/__init__.py
diff --git a/encoders/bootstrap.py → pk_classifier/bootstrap.py b/encoders/bootstrap.py → pk_classifier/bootstrap.py
diff --git a/results/stats.py → pk_classifier/stats.py b/results/stats.py → pk_classifier/stats.py
diff --git a/encoders/utils.py → pk_classifier/utils.py b/encoders/utils.py → pk_classifier/utils.py
diff --git a/results/__init__.py b/results/__init__.py
diff --git a/scripts/bootstrap_bow.py b/scripts/bootstrap_bow.py
@@ -8,7 +8,7 @@
 import xgboost as xgb
 import matplotlib.pyplot as plt
 from tqdm import tqdm
-from encoders.bootstrap import Tokenizer, TextSelector
+from pk_classifier.bootstrap import Tokenizer, TextSelector
 import argparse
 
 

diff --git a/scripts/bootstrap_dist.py b/scripts/bootstrap_dist.py
@@ -8,10 +8,10 @@
 import xgboost as xgb
 import matplotlib.pyplot as plt
 from tqdm import tqdm
-from encoders.bootstrap import Tokenizer, TextSelector
+from pk_classifier.bootstrap import Tokenizer, TextSelector
 import argparse
 import warnings
-from encoders.utils import read_jsonl
+from pk_classifier.utils import read_jsonl
 
 
 def f1_eval(y_pred, dtrain):

diff --git a/scripts/display_results.py b/scripts/display_results.py
@@ -1,4 +1,4 @@
-from results.stats import get_all_results
+from pk_classifier.stats import get_all_results
 import os
 import argparse
 

diff --git a/scripts/features_dist.py b/scripts/features_dist.py
@@ -1,9 +1,8 @@
-from features_bow import run as run_bow
 import os
 import pandas as pd
 from tqdm import tqdm
 from sklearn.pipeline import Pipeline, FeatureUnion
-from encoders.utils import make_pipeline, Embedder, ConcatenizerEmb
+from pk_classifier.utils import make_pipeline, Embedder, ConcatenizerEmb
 
 
 def pre_process(inp_path, out_path, field_list, ngrams, maxmin):

diff --git a/sparksetup/sparkconf.py b/sparksetup/sparkconf.py
@@ -6,7 +6,6 @@
 os.environ['JAVA_HOME'] = '/usr/lib/jvm/java-8-openjdk-amd64'
 findspark.init(spark_home="/opt/spark-2.4.4/")
 
-
 conf = SparkConf(). \
     setAppName('main'). \
     setMaster('local[*]'). \