Skip to content

The corresponding code for our paper: A sequence-to-sequence approach for document-level relation extraction.

License

Notifications You must be signed in to change notification settings

JohnGiorgi/seq2rel

Repository files navigation

seq2rel: A sequence-to-sequence approach for document-level relation extraction

ci codecov Checked with mypy GitHub Open in Streamlit

The corresponding code for our paper: A sequence-to-sequence approach for document-level relation extraction. Checkout our demo here!

Table of contents

Installation

This repository requires Python 3.7.1 or later.

Setting up a virtual environment

Before installing, you should create and activate a Python virtual environment. If you need pointers on setting up a virtual environment environment, please see the AllenNLP install instructions.

Installing the library and dependencies

If you don't plan on modifying the source code, install from git using pip

pip install git+https://github.com/JohnGiorgi/seq2rel.git

Otherwise, clone the repository and install from source using Poetry:

# Install poetry for your system: https://python-poetry.org/docs/#installation
curl -sSL https://raw.githubusercontent.com/python-poetry/poetry/master/get-poetry.py | python

# Clone and move into the repo
git clone https://github.com/JohnGiorgi/seq2rel
cd seq2rel

# Install the package with poetry
poetry install

Usage

Preparing a dataset

Datasets are tab-seperated files, where each example is contained on its own line. The first column contains the text, and the last column contains the relation. Relations themselves must be serialized to strings.

Take the following example, which denotes an adverse drug event ("@ADE@") between the drug benzodiazepine ("@DRUG@") and the effect coma ("@EFFECT@")

A review of the literature showed no previous description of this pattern in benzodiazepine coma.	@ADE@ benzodiazepine @DRUG@ coma @EFFECT@ @EOR@

For convenience, we provide a second package, seq2rel-ds, which makes it easy to generate data in this format for various popular corpora.

Training

To train the model, use the allennlp train command with one of our configs (or write your own!)

For example, to train a model on the Adverse Drug Event (ADE) corpus, first preprocess this data with seq2rel-ds

seq2rel-ds preprocess ade "path/to/preprocessed/ade"

Then, call allennlp train with the ADE config we have provided

allennlp train "training_config/transformer_copynet_ade.jsonnet" \
    --serialization-dir "output" \
    --overrides "{'train_data_path': 'path/to/preprocessed/ade/train.tsv'}" \
    --include-package "seq2rel" 

The --overrides flag allows you to override any field in the config with a JSON-formatted string, but you can equivalently update the config itself if you prefer. During training, models, vocabulary, configuration, and log files will be saved to the directory provided by --serialization-dir. This can be changed to any directory you like.

Inference

To use the model as a library, import Seq2Rel and pass it some text (it accepts both strings and lists of strings)

from seq2rel import Seq2Rel

# Pretrained models stored in GitHub. Downloaded and cached automatically. This model is ~500mb.
pretrained_model = "ade"

# Models are loaded via a dead-simple interface
seq2rel = Seq2Rel(pretrained_model)

# Extremely flexible inputs. User can provide...
# - a string
# - a list of strings
# - a text file (local path or URL)
input_text = "Ciprofloxacin-induced renal insufficiency in cystic fibrosis."

seq2rel(input_text)
>>> ['ciprofloxacin @DRUG@ renal insufficiency @EFFECT@ @ADE@']

See the list of available PRETRAINED_MODELS in seq2rel/seq2rel.py

python -c "from seq2rel import PRETRAINED_MODELS ; print(list(PRETRAINED_MODELS.keys()))"