This is a java implementation of the Health Assistant system described in the Bioinformatics paper "Health Assistant: Answering Your Questions Anytime from Biomedical Literature".
- The Bioasq Dataset are from http://participants-area.bioasq.org/
- The document collections consist of more than 26 million citations in XML format for biomedical literature from MEDLINE, life science journals, and online books, which is produced by National Library of Medicine (NLM).
- XML tags example:
- <PMID>: a unique id for each literature
- <DateCompleted>: the date when the article was completed
- <ArticleTitle>: the title of the article
- <Abstract>: the abstract of the article
- <MedlineJournalInfo>: the information about the journal
- <ChemicalList>: the list of chemicals included in the article
- <MeshHeadingList>: the list of MeshHeading included in the article
- <KeywordList>: the list of keywords for the article
The open source search engine code are from https://sourceforge.net/p/lemur/galago/ci/default/tree/
- Build index
- galago build [flags] --indexPath=<index> (--inputPath+<input>)+ --tokenizer/fields+{field-name}
- --inputPath Can be either a file or directory, and as many can be specified as you like. Galago can read html, xml, txt, arc (Heritrix), warc, trectext, trecweb and corpus files. Files may be compressed (.gz|.bz).
- --indexPath The directory path of the index to produce.
- Search
- galago batch-search --index=<path_to_index> --requested=N <path_to_query_file>
- --index=<path_to_index> Name and path to index.
- --requested=N Number of results to return for each query. [default=1000]
- <path_to_query_file> Input JSON query file.
The questions are from Bioasq Task 5b(http://participants-area.bioasq.org/).
- Files
- phaseA_5b_01.json: an example of a questions list file
- Pre-processing
- Use
phaseA_5b_01.json
- change filepath to your input files
- Run script:
- Format:
java BioasqQuestion.java;
- Output
questions/
(The folder with all the questions)
- Use
- Generate Document Queries
- Use
questions/
(The folder containing all the questions generated by the previous step) - Use
basefsdmw2.txt
(The file with basic configuration of FSDM method) - Use
basefsdmw3Mesh.txt
(The file with basic configuration of FSDM method) - Use
basepdfr.txt
(The file with basic configuration of PDFR method) - Use
Inquery stopwords.txt
(The file with stopwords) - Use
types.txt
(The file used to find similar words for w2v method ) - Use
bioasq5b_batch1_similar.txt
(The file used to find similar words for w2v method ) - Use
stanford-postagger-full-2015-04-20/
(The tool for extracting nouns) - change filepath to your input files
- Run script:
- Format:
java BioasqBaseline.java; (Generate a query file that uses the QL search method) java BioasqMySDM.java; (Generate a query file that uses the SDM search method) java BioasqMyFSDM.java; (Generate a query file that uses the FSDM search method) java BioasqNN.java; (Generate a query file that uses the QL+NN search method) java BioasqMySdmFsdm.java (Generate a query file that uses the SDM+FSDM search method) java BioasqMySdmPDFR.java; (Generate a query file that uses the SDM+PDFR search method) java BioasqSDMw2v.java; (Generate a query file that uses the SDM+W2V search method) java BioasqNNsdmfsdm.java; (Generate a query file that uses the SDM+NN+FSDM search method) java BioasqMySdmFsdmExpansion.java; (Generate a query file that uses the SDM+FSDM+PRF(title) search method) java BioasqMySdmFsdmMeshExpansion.java; (Generate a query file that uses the SDM+FSDM+PRF(mesh) search method) java BioasqNNsdmfsdmExpansion.java; (Generate a query file that uses the SDM+NN+FSDM+PRF(title) search method) java BioasqNNsdmfsdmMeshExpansion.java; (Generate a query file that uses the SDM+NN+FSDM+PRF(mesh) search method) java BioasqExpansion.java; (Generate extended query vocabulary based on search results)
- Use
- Generate Snippet Queries
- Use
questions/
(The folder containing all the questions generated by the Pre-processing step) - Use
basepdfr.txt
(The file with basic configuration of PDFR method) - Use
Inquery stopwords.txt
(The file with stopwords) - Use
queryWordFreq.txt
(The file with term frequency in queries) - Use
stanford-postagger-full-2015-04-20/
(The tool for extracting nouns) - change filepath to your input files
- Run script:
- Format:
java BioasqSnippet.java; (Generate the snippet candidate set based on the result of document retrieval) java BioasqSnippetRetrievalModel.java; (Generate query files for various methods such as sdm, PDFR, tfidf, w2v, etc.)
- Use
- Download TREC evalution tool from http://trec.nist.gov/trec_eval/trec_eval_latest.tar.gz;
- Run your IR models and generate standard TREC search result file
- Format:
#query_id Q0 #document_id rank predicted_relevance_score system_name;
- Format:
- Run the script "CreateQrels.java" to generate standard answers file
- Format:
#query_id 0 #ground_truth_document_id relevance_score;
- Format:
- Compile the tool through command
make
; - Run the evaluation:
trec_eval -q -c -M10 -m map path_to_answer_file path_to_result_file
If you use our work, please cite the following paper:
@article{JinHealth,
title={Health Assistant: Answering Your Questions Anytime from Biomedical Literature},
author={Jin, Zan Xia and Zhang, Bo Wen and Fang, Fan and Zhang, Le Le and Yin, Xu Cheng},
journal={Bioinformatics},
year={2019}
}