Edit Aware Representation Learning (EARL)

This is the official repository accompanying the paper Edit Aware Representation Learning via Levenshtein Prediction. If you use this code, please consider citing our paper as follows.

@inproceedings{marrese-taylor-etal-2023-edit,
    title = "Edit Aware Representation Learning via {L}evenshtein Prediction",
    author = "Marrese-taylor, Edison  and
      Reid, Machel  and
      Solano, Alfredo",
    booktitle = "The Fourth Workshop on Insights from Negative Results in NLP",
    month = may,
    year = "2023",
    address = "Dubrovnik, Croatia",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.insights-1.6",
    pages = "53--58",
    abstract = "We propose a novel approach that employs token-level Levenshtein operations to learn a continuous latent space of vector representations to capture the underlying semantic information with regard to the document editing process. Though our model outperforms strong baselines when fine-tuned on edit-centric tasks, it is unclear if these results are due to domain similarities between fine-tuning and pre-training data, suggesting that the benefits of our proposed approach over regular masked language-modelling pre-training are limited.",
}

Installation

cd code/fairseq
pip install --editable ./
pip install einops
pip install transformers
pip install python-Levenshtein

Create a folder to host the data. In our machine we used ~/data/early, please make sure to adjust our scripts if you decide to use a different path.

Pre-training

Data

Prepare data using:

python convert_peer_to_tsv.py --data ~/data/early/x.jsonl
bash detokenize_tsv_dataset.sh  ~/data/early/x.tsv
python convert_wikiedits_to_tsv.py --data ~/data/early/wikiedits

Preprocessing data (compute vocabularies, split and and binarize) using:

cd ~/data/early
bash preprocess.sh x.tsv.detok 80000 20000
bash preprocess.sh insertions.tsv.detok 9616458 2747560 
bash preprocess.sh deletions.tsv.detok 6546673 1870478 
bash preprocess.sh wikiedits 0 0

Training

Please refer to the train.sh for a reference script, and check the files inside the jobs folder for details of the exact hyperparameter settings we used when pre-training models on our cluster.

Downstream tasks

Preparing Data

For MNLI (from GLUE tasks)

Download the GLUE data with the script here https://github.com/nyu-mll/GLUE-baselines#downloading-glue (the RoBERTa script does not work anymore, check facebookresearch/fairseq#3840)

git clone https://github.com/nyu-mll/GLUE-baselines
python download_glue_data.py --data_dir glue_data --tasks all

Preprocess the data using the RoBERTa example script

bash code/fairseq/examples/roberta/preprocess_GLUE_tasks.sh ALL

For PAWS

Download paws by running:

cd glue_data
mkdir paws
cd paws
wget https://storage.googleapis.com/paws/english/paws_wiki_labeled_final.tar.gz
tar -xvzf paws_wiki_labeled_final.tar.gz

Process the data using convert_paws.sh /path/to/glue_data

Running/Training Models

Use our command, which is based on the RoBERTA documentation, as follows:

fairseq-hydra-train --config-dir configs --config-name <config_name> task.data=/path/to/data-bin checkpoint.restore_file=/path/to/roberta/model.pt checkpoint.save_dir=/output/path/ | tee -a /output/path/train.log

Where:

<config_name>: one of the names of the files inside the configs folder. For example, use x_default to fine-tune on the WikiEditsMix dataset.

Sumary of Changes Introduced to `fairseq`

Models:
- roberta.py: forward() function, enable multiple heads at the same time
- roberta.py: get_classification_head() add the reduce parameter to allow for for sequence-labeling tasks (such as the levenshtein prediction)
Tasks:
- lenveshtein_prediction.py: the new task, based on sentence_prediction.py
Criterions:
- lenveshtein_prediction.py: the loss for the new task, based on sentence_prediction.py

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
code/fairseq		code/fairseq
configs		configs
jobs		jobs
README.md		README.md
characters.py		characters.py
convert.sh		convert.sh
convert_peer_to_tsv.py		convert_peer_to_tsv.py
convert_wikiedits_to_tsv.py		convert_wikiedits_to_tsv.py
detokenize_tsv_dataset.sh		detokenize_tsv_dataset.sh
download.sh		download.sh
extract_edit_labels.py		extract_edit_labels.py
label_dict.txt		label_dict.txt
preprocess.sh		preprocess.sh
train.sh		train.sh
train_downstream.sh		train_downstream.sh
train_downstream_all.sh		train_downstream_all.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Edit Aware Representation Learning (EARL)

Installation

Pre-training

Data

Training

Downstream tasks

Preparing Data

Running/Training Models

Sumary of Changes Introduced to `fairseq`

About

Releases

Packages

Languages

epochx/earl

Folders and files

Latest commit

History

Repository files navigation

Edit Aware Representation Learning (EARL)

Installation

Pre-training

Data

Training

Downstream tasks

Preparing Data

Running/Training Models

Sumary of Changes Introduced to fairseq

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Sumary of Changes Introduced to `fairseq`

Packages