-
Notifications
You must be signed in to change notification settings - Fork 1
w2wei/dataset_retrieval_pipeline
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
1. Configure running environment Set up a Ubuntu 14.04 system, need about 32 GB memory and 500GB disk. Set up your working path WORK_PATH = "/your/path/here/" cd $WORK_PATH mkdir $WORK_PATH/code $WORK_PATH/data $WORK_PATH/results $WORK_PATH/tools $WORK_PATH/downloads Install Oracle JAVA JDK (https://www.digitalocean.com/community/tutorials/how-to-install-java-on-ubuntu-with-apt-get) sudo apt-get install python-software-properties sudo add-apt-repository ppa:webupd8team/java sudo apt-get update sudo apt-get install oracle-java8-installer If you have multiple JAVA JDK installed on the VM, you may refer to this link for management tips: https://askubuntu.com/questions/233190/what-exactly-does-update-alternatives-do Install Anaconda for Python 2.7 (https://docs.continuum.io/anaconda/install) wget -P /tools https://repo.continuum.io/archive/Anaconda2-4.3.1-Linux-x86_64.sh bash ~/tools/Anaconda2-4.3.1-Linux-x86_64.sh Install Python packages NLTK corpora/stopwords, NLTK punkt/english, Biopython, elasticsearch NLTK packages: Option 1: python -m nltk.downloader all ## install all corpus and models Option 2: python -m nltk.downloader ## enter an interactive interface and select the required packages Biopython: conda install -c anaconda biopython=1.68 elasticsearch pip install elasticsearch Install Elasticsearch 5.0.2 (ES) (https://www.elastic.co/guide/en/elasticsearch/reference/5.0/install-elasticsearch.html) cd $WORK_PATH/tools ## install ES under /tools wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-5.0.2.tar.gz sha1sum elasticsearch-5.0.2.tar.gz tar -xzf elasticsearch-5.0.2.tar.gz Configure ES Change the JVM heap size for ES (https://www.elastic.co/guide/en/elasticsearch/reference/5.0/heap-size.html) vim $WORK_PATH/tools/elasticsearch-5.0.2/config/jvm.options Set both Xms and Xmx to a reasonable value, such as 10g: -Xms10, -Xmx10g Add protected words to ES wget -P $WORK_PATH/downloads ftp://nlmpubs.nlm.nih.gov/online/mesh/2016/asciimesh/d2016.bin mkdir $WORK_PATH/tools/elasticsearch-5.0.2/config/analysis ## call get_MeSH_vocab.py to generate "mesh_and_entry_vocab.txt" python $WORK_PATH/code/get_MeSH_vocab.py $WORK_PATH/downloads/d2016.bin $WORK_PATH/tools/elasticsearch-5.0.2/config/analysis ## check if "mesh_and_entry_vocab.txt" is in $WORK_PATH/tools/elasticsearch-5.0.2/config/analysis Start ES as a daemon process (https://www.elastic.co/guide/en/elasticsearch/reference/current/setup.html) $WORK_PATH/tools/elasticsearch-5.0.2/bin/elasticsearch -d -p Get ES status curl -XGET '127.0.0.1:9200/_stats?pretty' ## show general information curl -XGET 'http://127.0.0.1:9200/_cat/indices?v' ## show all indices curl -XGET 'http://127.0.0.1:9200/_count?pretty' ## count the documents in all indices ## delete an index curl -XDELETE 'http://127.0.0.1:9200/index_name_here?pretty' Install MetaMap (optional) ## follow the instruction from https://metamap.nlm.nih.gov/Installation.shtml 2. Prepare data a. Metadata of datasets provided by the bioCADDIE Challenge wget -P $WORK_PATH/data https://biocaddie.org/sites/default/files/update_json_folder.zip unzip $WORK_PATH/data https://biocaddie.org/sites/default/files/update_json_folder.zip mv $WORK_PATH/data/update_json_folder $WORK_PATH/data/datamed_json ## rename the decompressed directory datamed_json b. Collect additional information for the datasets Option 1: use prepared documents mv $WORK_PATH/code/additional_fields.tar.gz $WORK_PATH/data tar -zxf $WORK_PATH/data/additional_fields.tar.gz Option 2: get additional information from scratch mkdir $WORK_PATH/data/additional_fields ## Run split_tasks.py to prepare data cd $WORK_PATH/code/ext_fields python split_tasks.py ## Run ret.sh to collect additional information sh ret.sh c. Re-generate keyword queries file kw_questions.txt for PSD-keywords(optional). *kw_questions.txt is already included under $WORK_PATH/code* ## Create a file for each question ## run metamap on each file using the default setting, save the output files. cd $WORK_PATH/code/rerank/metamap javac *.java java KeyWordExtractor /path/to/your/metamap_output_file ## Merge the results to generate kw_questions.txt d. Re-generate google returned documents for the Distribution Shift method (optional) *google_questions.txt is already included under $WORK_PATH/code* ## make sure you have Graphic User Interface ## Install Selenium tool from http://www.seleniumhq.org/ ## Update $WORK_PATH/code/rerank/dist_shift/Google.java according to the browser you use. Currently it uses Chrome by default. ## javac *.java java Google $WORK_PATH/code/all_questions.txt $WORK_PATH/code/google_questions.txt 3. Run experiments ## Bash scripts for experiments are under $WORK_PATH/code/experiments ## Check $CODE_DIR/rerank/PSD/Constants.java, make sure all the paths are correct
About
No description, website, or topics provided.
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published