RELiC: Retrieving Evidence for Literary Claims

This is the official codebase accompanying our ACL 2022 paper "RELiC: Retrieving Evidence for Literary Claims" (https://relic.cs.umass.edu). You can find our paper on arXiv here.

Setup

The code uses PyTorch 1.10+ and HuggingFace's transformers library for training the RoBERTa models. To install PyTorch, look for the Python package compatible with your local CUDA setup here.

virtualenv relic-venv
source relic-venv/bin/activate
pip install torch torchvision # currently, this is the version compatible with CUDA 10.1
pip install transformers
pip install tensorboardX
pip install --editable .

Download the dataset from this link. Your RELiC folder should look like,

(relic-venv) kalpesh@node187:relic-retrieval$ ls RELiC/
test.json  train.json  val.json
(relic-venv) kalpesh@node187:relic-retrieval$

Pretrained dense-RELiC models

All pretrained models can be found in the dataset Google Drive folder. Individual checkpoint links are added below,

Model	Google Drive link
dense-RELiC (4 left, 4 right sentences)	link
dense-RELiC (4 left, 0 right sentences)	link
dense-RELiC (0 left, 4 right sentences)	link
dense-RELiC (1 left, 1 right sentences)	link
dense-RELiC (1 left, 0 right sentences)	link

Evaluation

Make sure you have downloaded the dataset as described above. The evaluation script assumes the pretrained models are downloaded from the Google Drive links above and placed in the retriever_train/saved_models. It's best to run this on a GPU, since dense vectors need to be computed before retrieval takes place.

# you may need to run "export CUDA_VISIBLE_DEVICES=0" to use GPU-0
# remove --cache if you don't wish to write a large output file with retrieval ranks
python scripts/relic_evaluation.py \
    --model retriever_train/saved_models/model_denserelic_4_4 \
    --cache

# output
Results with all quotes (7833 instances):
mean_rank = 704.6351, recall@1 = 0.0672, recall@3 = 0.1407, recall@5 = 0.1840, recall@10 = 0.2578, recall@50 = 0.4501, recall@100 = 0.5361, num_candidates = 10199.8426

The above script may take a while to finish (20-30 minutes on validation data). To run it on a single book only, run:

python scripts/relic_evaluation.py \
    --model retriever_train/saved_models/model_denserelic_4_4 \
    --eval_small

# output
Results with all quotes (1648 instances):
mean_rank = 796.5661, recall@1 = 0.0583, recall@3 = 0.1147, recall@5 = 0.1481, recall@10 = 0.2093, recall@50 = 0.3914, recall@100 = 0.4745, num_candidates = 9775.8471

Training dense-RELiC

Preprocess Dataset

Make sure you have downloaded the dataset as described above. Run the following preprocessing script (adjust the --left_sents / --right_sents flags for shorter contexts):

python scripts/preprocess_lit_analysis_data.py --left_sents 4 --right_sents 4

Training and early stopping evaluation

Two scripts are used while training dense-ReLIC, a model training script and an early stopping evaluation script. Both scripts can be run simultaneously --- the evaluation script periodically looks at the checkpoint folder and deletes suboptimal checkpoints. Alternatively, the evaluation script can be run after the model training is finished (to find the best checkpoints).

There are two ways to run training (directly or using SLURM) ---

Run the example bash scripts directly,

# in terminal 1
# you may need to run "export CUDA_VISIBLE_DEVICES=0" to use GPU-0
bash retriever_train/examples/schedule.sh

# in terminal 2
# you may need to run "export CUDA_VISIBLE_DEVICES=1" to use GPU-1
# this script is used for early stopping checkpoints, it is not a precise evaluation.
bash retriever_train/examples/evaluate.sh

If you have a SLURM setup, you can configure model hyperparameters using retriever_train/hyperparameter_config.py (which supports grid search too) and then run,

python retriever_train/schedule.py

This script launches both train / evaluation processes simultaneously on SLURM giving them a unique job_id (let's say X). You can access the logs using,

### Access training logs
cat retriever_train/logs/log_X.txt

### Access early stopping evaluation logs
cat retriever_train/logs/log_eval_X.txt

### Access hyperparameter config for experiment X
cat retriever_train/logs/expts.txt | grep -A "model_X"

### Access the bash scripts running on SLURM
cat retriever_train/slurm-schedulers/schedule_X.sh
cat retriever_train/slurm-schedulers/evaluate_X.sh

This script exports checkpoints to retriever_train/saved_models/model_X. There's also TensorBoard support, see retriever_train/runs.

NOTE: You may need to make minor changes to retriever_train/run_finetune_gpt2_template.sh, retriever_train/run_evaluate_gpt2_template.sh and retriever_train/schedule.py to make them compatible with your SLURM setup.

Running Baselines (DPR, SIM, c-REALM)

Additional libraries will be needed to run the baseline retrievers.

SIM --- A semantic similarity model from Wieting et al. 2019 trained on STS data.

pip install nltk
pip install sentencepiece

# remove --cache if you don't wish to write a large output file with retrieval ranks
python scripts/relic_evaluation_sim.py --left_sents 1 --right_sents 1 --cache

DPR --- A retriever from Karphukin et al. 2020 trained on Natural Questions data.

# remove --cache if you don't wish to write a large output file with retrieval ranks
python scripts/relic_evaluation_dpr.py --left_sents 1 --right_sents 1 --cache

c-REALM --- A retriever from Krishna et al. 2021 based on REALM and trained on ELI5 data.

### for c-REALM
# TF 2.3 is the version compatible with CUDA 10.1
# See https://www.tensorflow.org/install/source#gpu for TF-CUDA mapping
pip install tensorflow==2.3
pip install tensor2tensor

# Download and unzip the c-REALM checkpoint
wget https://storage.googleapis.com/rt-checkpoint/retriever.zip
unzip retriever.zip && rm retriever.zip
mv retriever crealm-retriever
rm -rf crealm-retriever/encoded_*

# remove --cache if you don't wish to write a large output file with retrieval ranks
python scripts/relic_evaluation_crealm.py --left_sents 1 --right_sents 1 --cache

Random retrieval

python scripts/relic_evaluation_random.py  --num_samples 100 --split val

Leaderboard Submission

You may submit your predictions for the test set here: https://forms.gle/1B6JuQ3nbGXCR2kC8. The leaderboard is maintained on the RELiC project page here.

The format of your submission file should be a .json file that is a dictionary where the unique IDs of each test set quote are the keys, and the values are a rank list. This list should contain the 100 indices of the top 100 candidates retriever by your model, in rank order. For example, if your retriever's top-ranked candidate is 99 for test set quote ID "070789" and 1532 for quote ID "070790", your .json dict should look like:

{
    "070789": [99, ...],
    "070790": [1532, ...],
    ...
}

To make this file with dense-RELiC (or any of our other baselines), run the corresponding evaluation script with the --split test flag:

python scripts/relic_evaluation.py \
    --model retriever_train/saved_models/model_denserelic_4_4 \
    --split test

This will output a file retriever_train/saved_models/model_denserelic_4_4/test_submission.json, which you should upload to the Google Form. We will read this JSON file using scripts/score_submission.py using a hidden key file, and upload the results on the leaderboard.

Contact & Citation

If you run into any issues, please contact [email protected] and [email protected].

If you found our paper or this repository useful, please cite:

@inproceedings{relic22,
author={Katherine Thai and Yapei Chang and Kalpesh Krishna and Mohit Iyyer},
Booktitle = {Association of Computational Linguistics},
Year = "2022",
Title={RELiC: Retrieving Evidence for Literary Claims},
}

Name		Name	Last commit message	Last commit date
Latest commit History 53 Commits
relic_preprocessed		relic_preprocessed
retriever_train		retriever_train
scripts		scripts
.gitignore		.gitignore
DATA.md		DATA.md
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RELiC: Retrieving Evidence for Literary Claims

Setup

Pretrained dense-RELiC models

Evaluation

Training dense-RELiC

Preprocess Dataset

Training and early stopping evaluation

Running Baselines (DPR, SIM, c-REALM)

Leaderboard Submission

Contact & Citation

About

Releases

Packages

Contributors 2

Languages

License

martiansideofthemoon/relic-retrieval

Folders and files

Latest commit

History

Repository files navigation

RELiC: Retrieving Evidence for Literary Claims

Setup

Pretrained dense-RELiC models

Evaluation

Training dense-RELiC

Preprocess Dataset

Training and early stopping evaluation

Running Baselines (DPR, SIM, c-REALM)

Leaderboard Submission

Contact & Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages