-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathREADME
85 lines (82 loc) · 5.3 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
1. Configure running environment
Set up a Ubuntu 14.04 system, need about 32 GB memory and 500GB disk.
Set up your working path
WORK_PATH = "/your/path/here/"
cd $WORK_PATH
mkdir $WORK_PATH/code $WORK_PATH/data $WORK_PATH/results $WORK_PATH/tools $WORK_PATH/downloads
Install Oracle JAVA JDK (https://www.digitalocean.com/community/tutorials/how-to-install-java-on-ubuntu-with-apt-get)
sudo apt-get install python-software-properties
sudo add-apt-repository ppa:webupd8team/java
sudo apt-get update
sudo apt-get install oracle-java8-installer
If you have multiple JAVA JDK installed on the VM, you may refer to this link for management tips: https://askubuntu.com/questions/233190/what-exactly-does-update-alternatives-do
Install Anaconda for Python 2.7 (https://docs.continuum.io/anaconda/install)
wget -P /tools https://repo.continuum.io/archive/Anaconda2-4.3.1-Linux-x86_64.sh
bash ~/tools/Anaconda2-4.3.1-Linux-x86_64.sh
Install Python packages NLTK corpora/stopwords, NLTK punkt/english, Biopython, elasticsearch
NLTK packages:
Option 1: python -m nltk.downloader all ## install all corpus and models
Option 2: python -m nltk.downloader ## enter an interactive interface and select the required packages
Biopython:
conda install -c anaconda biopython=1.68
elasticsearch
pip install elasticsearch
Install Elasticsearch 5.0.2 (ES) (https://www.elastic.co/guide/en/elasticsearch/reference/5.0/install-elasticsearch.html)
cd $WORK_PATH/tools ## install ES under /tools
wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-5.0.2.tar.gz
sha1sum elasticsearch-5.0.2.tar.gz
tar -xzf elasticsearch-5.0.2.tar.gz
Configure ES
Change the JVM heap size for ES (https://www.elastic.co/guide/en/elasticsearch/reference/5.0/heap-size.html)
vim $WORK_PATH/tools/elasticsearch-5.0.2/config/jvm.options
Set both Xms and Xmx to a reasonable value, such as 10g: -Xms10, -Xmx10g
Add protected words to ES
wget -P $WORK_PATH/downloads ftp://nlmpubs.nlm.nih.gov/online/mesh/2016/asciimesh/d2016.bin
mkdir $WORK_PATH/tools/elasticsearch-5.0.2/config/analysis
## call get_MeSH_vocab.py to generate "mesh_and_entry_vocab.txt"
python $WORK_PATH/code/get_MeSH_vocab.py $WORK_PATH/downloads/d2016.bin $WORK_PATH/tools/elasticsearch-5.0.2/config/analysis
## check if "mesh_and_entry_vocab.txt" is in $WORK_PATH/tools/elasticsearch-5.0.2/config/analysis
Start ES as a daemon process (https://www.elastic.co/guide/en/elasticsearch/reference/current/setup.html)
$WORK_PATH/tools/elasticsearch-5.0.2/bin/elasticsearch -d -p
Get ES status
curl -XGET '127.0.0.1:9200/_stats?pretty' ## show general information
curl -XGET 'http://127.0.0.1:9200/_cat/indices?v' ## show all indices
curl -XGET 'http://127.0.0.1:9200/_count?pretty' ## count the documents in all indices
## delete an index
curl -XDELETE 'http://127.0.0.1:9200/index_name_here?pretty'
Install MetaMap (optional)
## follow the instruction from https://metamap.nlm.nih.gov/Installation.shtml
2. Prepare data
a. Metadata of datasets provided by the bioCADDIE Challenge
wget -P $WORK_PATH/data https://biocaddie.org/sites/default/files/update_json_folder.zip
unzip $WORK_PATH/data https://biocaddie.org/sites/default/files/update_json_folder.zip
mv $WORK_PATH/data/update_json_folder $WORK_PATH/data/datamed_json ## rename the decompressed directory datamed_json
b. Collect additional information for the datasets
Option 1: use prepared documents
mv $WORK_PATH/code/additional_fields.tar.gz $WORK_PATH/data
tar -zxf $WORK_PATH/data/additional_fields.tar.gz
Option 2: get additional information from scratch
mkdir $WORK_PATH/data/additional_fields
## Run split_tasks.py to prepare data
cd $WORK_PATH/code/ext_fields
python split_tasks.py
## Run ret.sh to collect additional information
sh ret.sh
c. Re-generate keyword queries file kw_questions.txt for PSD-keywords(optional).
*kw_questions.txt is already included under $WORK_PATH/code*
## Create a file for each question
## run metamap on each file using the default setting, save the output files.
cd $WORK_PATH/code/rerank/metamap
javac *.java
java KeyWordExtractor /path/to/your/metamap_output_file
## Merge the results to generate kw_questions.txt
d. Re-generate google returned documents for the Distribution Shift method (optional)
*google_questions.txt is already included under $WORK_PATH/code*
## make sure you have Graphic User Interface
## Install Selenium tool from http://www.seleniumhq.org/
## Update $WORK_PATH/code/rerank/dist_shift/Google.java according to the browser you use. Currently it uses Chrome by default.
## javac *.java
java Google $WORK_PATH/code/all_questions.txt $WORK_PATH/code/google_questions.txt
3. Run experiments
## Bash scripts for experiments are under $WORK_PATH/code/experiments
## Check $CODE_DIR/rerank/PSD/Constants.java, make sure all the paths are correct