The corresponding code for our paper: A sequence-to-sequence approach for document-level relation extraction. Checkout our demo here!
This repository requires Python 3.7.1 or later.
Before installing, you should create and activate a Python virtual environment. If you need pointers on setting up a virtual environment environment, please see the AllenNLP install instructions.
If you don't plan on modifying the source code, install from git
using pip
pip install git+https://github.com/JohnGiorgi/seq2rel.git
Otherwise, clone the repository and install from source using Poetry:
# Install poetry for your system: https://python-poetry.org/docs/#installation
curl -sSL https://raw.githubusercontent.com/python-poetry/poetry/master/get-poetry.py | python
# Clone and move into the repo
git clone https://github.com/JohnGiorgi/seq2rel
cd seq2rel
# Install the package with poetry
poetry install
Datasets are tab-seperated files, where each example is contained on its own line. The first column contains the text, and the last column contains the relation. Relations themselves must be serialized to strings.
Take the following example, which denotes an adverse drug event ("@ADE@"
) between the drug benzodiazepine ("@DRUG@"
) and the effect coma ("@EFFECT@
")
A review of the literature showed no previous description of this pattern in benzodiazepine coma. @ADE@ benzodiazepine @DRUG@ coma @EFFECT@ @EOR@
For convenience, we provide a second package, seq2rel-ds, which makes it easy to generate data in this format for various popular corpora.
To train the model, use the allennlp train
command with one of our configs (or write your own!)
For example, to train a model on the Adverse Drug Event (ADE) corpus, first preprocess this data with seq2rel-ds
seq2rel-ds preprocess ade "path/to/preprocessed/ade"
Then, call allennlp train
with the ADE config we have provided
allennlp train "training_config/transformer_copynet_ade.jsonnet" \
--serialization-dir "output" \
--overrides "{'train_data_path': 'path/to/preprocessed/ade/train.tsv'}" \
--include-package "seq2rel"
The --overrides
flag allows you to override any field in the config with a JSON-formatted string, but you can equivalently update the config itself if you prefer. During training, models, vocabulary, configuration, and log files will be saved to the directory provided by --serialization-dir
. This can be changed to any directory you like.
To use the model as a library, import Seq2Rel
and pass it some text (it accepts both strings and lists of strings)
from seq2rel import Seq2Rel
# Pretrained models stored in GitHub. Downloaded and cached automatically. This model is ~500mb.
pretrained_model = "ade"
# Models are loaded via a dead-simple interface
seq2rel = Seq2Rel(pretrained_model)
# Extremely flexible inputs. User can provide...
# - a string
# - a list of strings
# - a text file (local path or URL)
input_text = "Ciprofloxacin-induced renal insufficiency in cystic fibrosis."
seq2rel(input_text)
>>> ['ciprofloxacin @DRUG@ renal insufficiency @EFFECT@ @ADE@']
See the list of available PRETRAINED_MODELS
in seq2rel/seq2rel.py
python -c "from seq2rel import PRETRAINED_MODELS ; print(list(PRETRAINED_MODELS.keys()))"