Skip to content

Answering your questions from biomedical literature

Notifications You must be signed in to change notification settings

jinzanxia/biomedical-QA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Health Assistant: Answering Your Questions Anytime from Biomedical Literature

This is a java implementation of the Health Assistant system described in the Bioinformatics paper "Health Assistant: Answering Your Questions Anytime from Biomedical Literature".

Table of Contents

Bioasq Dataset

  • The Bioasq Dataset are from http://participants-area.bioasq.org/
  • The document collections consist of more than 26 million citations in XML format for biomedical literature from MEDLINE, life science journals, and online books, which is produced by National Library of Medicine (NLM).
  • XML tags example:
    • <PMID>: a unique id for each literature
    • <DateCompleted>: the date when the article was completed
    • <ArticleTitle>: the title of the article
    • <Abstract>: the abstract of the article
    • <MedlineJournalInfo>: the information about the journal
    • <ChemicalList>: the list of chemicals included in the article
    • <MeshHeadingList>: the list of MeshHeading included in the article
    • <KeywordList>: the list of keywords for the article

Search Engine

The open source search engine code are from https://sourceforge.net/p/lemur/galago/ci/default/tree/

  • Build index
    • galago build [flags] --indexPath=<index> (--inputPath+<input>)+ --tokenizer/fields+{field-name}
    • --inputPath Can be either a file or directory, and as many can be specified as you like. Galago can read html, xml, txt, arc (Heritrix), warc, trectext, trecweb and corpus files. Files may be compressed (.gz|.bz).
    • --indexPath The directory path of the index to produce.
  • Search
    • galago batch-search --index=<path_to_index> --requested=N <path_to_query_file>
    • --index=<path_to_index> Name and path to index.
    • --requested=N Number of results to return for each query. [default=1000]
    • <path_to_query_file> Input JSON query file.

Process Queries

The questions are from Bioasq Task 5b(http://participants-area.bioasq.org/).

  • Files
    • phaseA_5b_01.json: an example of a questions list file
  • Pre-processing
    • Use phaseA_5b_01.json
    • change filepath to your input files
    • Run script:
      • Format:
       java BioasqQuestion.java;
      
    • Output questions/ (The folder with all the questions)
  • Generate Document Queries
    • Use questions/ (The folder containing all the questions generated by the previous step)
    • Use basefsdmw2.txt (The file with basic configuration of FSDM method)
    • Use basefsdmw3Mesh.txt (The file with basic configuration of FSDM method)
    • Use basepdfr.txt (The file with basic configuration of PDFR method)
    • Use Inquery stopwords.txt (The file with stopwords)
    • Use types.txt (The file used to find similar words for w2v method )
    • Use bioasq5b_batch1_similar.txt (The file used to find similar words for w2v method )
    • Use stanford-postagger-full-2015-04-20/ (The tool for extracting nouns)
    • change filepath to your input files
    • Run script:
      • Format:
       java BioasqBaseline.java;	(Generate a query file that uses the QL search method)
       java BioasqMySDM.java;		(Generate a query file that uses the SDM search method)
       java BioasqMyFSDM.java;	(Generate a query file that uses the FSDM search method)
       java BioasqNN.java;		(Generate a query file that uses the QL+NN search method)
       java BioasqMySdmFsdm.java	(Generate a query file that uses the SDM+FSDM search method)
       java BioasqMySdmPDFR.java;	(Generate a query file that uses the SDM+PDFR search method)
       java BioasqSDMw2v.java;	(Generate a query file that uses the SDM+W2V search method)
       java BioasqNNsdmfsdm.java;	(Generate a query file that uses the SDM+NN+FSDM search method)
       java BioasqMySdmFsdmExpansion.java;		(Generate a query file that uses the SDM+FSDM+PRF(title) search method)
       java BioasqMySdmFsdmMeshExpansion.java;	(Generate a query file that uses the SDM+FSDM+PRF(mesh) search method)
       java BioasqNNsdmfsdmExpansion.java;		(Generate a query file that uses the SDM+NN+FSDM+PRF(title) search method)
       java BioasqNNsdmfsdmMeshExpansion.java;	(Generate a query file that uses the SDM+NN+FSDM+PRF(mesh) search method)
       java BioasqExpansion.java;			(Generate extended query vocabulary based on search results)
      
  • Generate Snippet Queries
    • Use questions/ (The folder containing all the questions generated by the Pre-processing step)
    • Use basepdfr.txt (The file with basic configuration of PDFR method)
    • Use Inquery stopwords.txt (The file with stopwords)
    • Use queryWordFreq.txt (The file with term frequency in queries)
    • Use stanford-postagger-full-2015-04-20/ (The tool for extracting nouns)
    • change filepath to your input files
    • Run script:
      • Format:
       java BioasqSnippet.java;  			(Generate the snippet candidate set based on the result of document retrieval)
       java BioasqSnippetRetrievalModel.java;  	(Generate query files for various methods such as sdm, PDFR, tfidf, w2v, etc.)
      

Evaluation

  • Download TREC evalution tool from http://trec.nist.gov/trec_eval/trec_eval_latest.tar.gz;
  • Run your IR models and generate standard TREC search result file
    • Format:
      #query_id  Q0 #document_id rank predicted_relevance_score system_name;
      
  • Run the script "CreateQrels.java" to generate standard answers file
    • Format:
      #query_id 0 #ground_truth_document_id relevance_score;
      
  • Compile the tool through command make;
  • Run the evaluation:
    trec_eval -q -c -M10 -m map path_to_answer_file path_to_result_file
    

Citation

If you use our work, please cite the following paper:

@article{JinHealth,
 title={Health Assistant: Answering Your Questions Anytime from Biomedical Literature},
 author={Jin, Zan Xia and Zhang, Bo Wen and Fang, Fan and Zhang, Le Le and Yin, Xu Cheng},
 journal={Bioinformatics},
 year={2019}
}

About

Answering your questions from biomedical literature

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages