Skip to content

A dataset for mention-agnostic biomedical information extraction

License

Notifications You must be signed in to change notification settings

norikinishida/hoip-dataset

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

HOIP dataset

Dataset for our BioNLP'24 paper titled "Mention-Agnostic Information Extraction for Ontological Annotation of Biomedical Articles".

Named entities are typically assumed to appear explicitly in text (such textual instances are called mentions), and entity features are derived based on the mentions. Mentions are strong indicators in information extraction tasks, since they directly indicate how entities are described in text. However, in real-world scenarios, important entities sometimes appear only implicitly.

To accelerate the research on mention-agnostic information extraction, we introduce HOIP dataset, a new biomedical dataset constructed based on Homeostasis Imbalance Process Ontology (HOIP), which focuses on understanding the COVID-19 infectious mechanism (courses).

  • HOIP dataset consists of passages (plain text) extracted from PubMed and Wikipedia articles describing biomedical processes in the context of COVID-19 infectious courses. Each passage is a brief portion of an article that describes at least two specific processes.
  • HOIP dataset annotates both entities and relation triples, (head entity, relation, tail entity).
  • HOIP dataset requires the capability to infer about entities and relations between them that are not explicitly described, using background knowledge.

The following figure shows an example in the HOIP dataset along with the approach proposed in our paper. example

For the details of the dataset, please see our paper.

HOIP ontology is also available from the NCBO BioPortal ontology repository site (https://bioportal.bioontology.org/ontologies/HOIP) and GitHub website (https://github.com/yuki-yamagata/hoip).

Directory structure

.
|-- README.md
|-- LICENSE
|-- releases/ # dataset
|   |-- v1/
|       |-- train.json
|       |-- dev.json
|       |-- test.json
|       |--- hoip_ontology.json
|-- construction/ # source codes to generate the dataset
|-- docs/ # our paper and some figures

Citation

If you use the dataset, please cite this paper:

@inproceedings{khettari-etal-2024-mention,
    title={Mention-Agnostic Information Extraction for Ontological Annotation of Biomedical Articles},
    author={
        Khettari, Oumaima El and
        Nishida, Noriki and
        Liu, Shanshan and
        Munne, Rumana Ferdous and
        Yamagata, Yuki and
        Quiniou, Solen and
        Chaffron, Samuel and
        Matsumoto, Yuji
    },
    booktitle={The 23rd Workshop on Biomedical Natural Language Processing and BioNLP Shared Tasks},
    month={August},
    year={2024},
    publisher={Association for Computational Linguistics},
    url={},
    doi={}
}

About

A dataset for mention-agnostic biomedical information extraction

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published