A PyTorch package used to fine-tune pre-trained Transformers for sequence-to-sequence language generation
The recommended way to run the code is using docker:
docker run -it --rm --runtime=nvidia --ipc=host --privileged pytorch/pytorch:1.2-cuda10.0-cudnn7-devel bash
The following Python package need to be installed:
pip install --user methodtools py-rouge pyrouge nltk
python -c "import nltk; nltk.download('punkt')"
git clone https://github.com/NVIDIA/apex.git && cd apex && git reset --hard de6378f5dae8fcf2879a4be8ecea8bbcb9e59d5 && python setup.py install --cuda_ext --cpp_ext
Install the repo as a package:
git clone this repo into ${code_dir}
cd ${code_dir} ; pip install --editable .
We recommend to use the uncased model:
- unilm1.2-base-uncased: 12-layer, 768-hidden, 12-heads, 110M parameters
- unilm2-base-uncased: 12-layer, 768-hidden, 12-heads, 110M parameters
If you would like to use a cased model:
- unilm1-base-cased: 12-layer, 768-hidden, 12-heads, 110M parameters
- unilm1-large-cased: 24-layer, 1024-hidden, 16-heads, 340M parameters
- unilm2-large-uncased: 24-layer, 1024-hidden, 16-heads, 340M parameters
- unilm2-large-cased: 24-layer, 1024-hidden, 16-heads, 340M parameters
If you prefer small pretrained models for faster inference speed:
- minilm-l12-h384-uncased: 12-layer, 384-hidden, 12-heads, 33M parameters
We support two dataset formats:
- Text format: each line contains a json string of an example.
"src"
contains source sequence text,"tgt"
contains target sequence text ("tgt"
can be ignored for decoding). The data should be pre-processed as follows:
{"src": "Messages posted on social media claimed the user planned to `` kill as many people as possible ''", "tgt": "Threats to kill pupils in a shooting at a Blackpool school are being investigated by Lancashire police ."}
{"src": "Media playback is unsupported on your device", "tgt": "A slide running the entire length of one of the steepest city centre streets in Europe has been turned into a massive three-lane water adventure ."}
{"src": "Chris Erskine crossed low for Kris Doolan to tap home and give the Jags an early lead .", "tgt": "Partick Thistle will finish in the Scottish Premiership 's top six for the first time after beating Motherwell"}
- Tokenized format: if you use tokenized data (with the same WordPiece tokenizers as BERT),
"src"
is a list of source sequence tokens, and"tgt"
is a list of target sequence tokens ("tgt"
can be ignored for decoding):
{"src": ["messages", "posted", "on", "social", "media", "claimed", "the", "user", "planned", "to", "\"", "kill", "as", "many", "people", "as", "possible", "\""], "tgt": ["threats", "to", "kill", "pupils", "in", "a", "shooting", "at", "a", "blackpool", "school", "are", "being", "investigated", "by", "lancashire", "police", "."]}
{"src": ["media", "playback", "is", "un", "##su", "##pp", "##orted", "on", "your", "device"], "tgt": ["a", "slide", "running", "the", "entire", "length", "of", "one", "of", "the", "steep", "##est", "city", "centre", "streets", "in", "europe", "has", "been", "turned", "into", "a", "massive", "three", "-", "lane", "water", "adventure", "."]}
{"src": ["chris", "erskine", "crossed", "low", "for", "kris", "doo", "##lan", "to", "tap", "home", "and", "give", "the", "ja", "##gs", "an", "early", "lead", "."], "tgt": ["part", "##ick", "thistle", "will", "finish", "in", "the", "scottish", "premiership", "'", "s", "top", "six", "for", "the", "first", "time", "after", "beating", "mother", "##well"]}
The code automatically detects the input format. If the json line contains list
, we process the input as the tokenized format; if the json line contains string
, the code will tokenize them.
Example: XSum with unilm1.2-base-uncased
Pre-processed json dataset links: text format, or tokenized format.
# path of training data
TRAIN_FILE=/your/path/to/train.json
# folder used to save fine-tuned checkpoints
OUTPUT_DIR=/your/path/to/save_checkpoints
# folder used to cache package dependencies
CACHE_DIR=/your/path/to/transformer_package_cache
export CUDA_VISIBLE_DEVICES=0,1,2,3
python -m torch.distributed.launch --nproc_per_node=4 run_seq2seq.py \
--train_file ${TRAIN_FILE} --output_dir ${OUTPUT_DIR} \
--model_type unilm --model_name_or_path unilm1.2-base-uncased \
--do_lower_case --fp16 --fp16_opt_level O2 --max_source_seq_length 464 --max_target_seq_length 48 \
--per_gpu_train_batch_size 16 --gradient_accumulation_steps 1 \
--learning_rate 7e-5 --num_warmup_steps 500 --num_training_steps 32000 --cache_dir ${CACHE_DIR}
- The fine-tuning batch size =
number of gpus
*per_gpu_train_batch_size
*gradient_accumulation_steps
. So in the above example, the batch size is4*16*1 = 64
. The three arguments need to be adjusted together in order to remain the total batch size unchanged. --do_lower_case
: for uncased models
# path of the fine-tuned checkpoint
MODEL_PATH=/your/path/to/model_checkpoint
SPLIT=validation
# input file that you would like to decode
INPUT_JSON=/your/path/to/${SPLIT}.json
export CUDA_VISIBLE_DEVICES=0
export OMP_NUM_THREADS=4
export MKL_NUM_THREADS=4
python decode_seq2seq.py \
--fp16 --model_type unilm --tokenizer_name unilm1.2-base-uncased --input_file ${INPUT_JSON} --split $SPLIT --do_lower_case \
--model_path ${MODEL_PATH} --max_seq_length 512 --max_tgt_length 48 --batch_size 32 --beam_size 5 \
--length_penalty 0 --forbid_duplicate_ngrams --mode s2s --forbid_ignore_word "."
- The decoding results are saved at
${MODEL_PATH}.${SPLIT}
. --do_lower_case
: for uncased models
The golden answer text files can be downloaded at here.
SPLIT=validation
GOLD_PATH=/your/path/to/${SPLIT}.target
# ${MODEL_PATH}.${SPLIT} is the predicted target file
python evaluations/eval_for_xsum.py --pred ${MODEL_PATH}.${SPLIT} --gold ${GOLD_PATH} --split ${SPLIT}
Example: XSum with minilm-l12-h384-uncased
Pre-processed json dataset links: text format, or tokenized format.
# path of training data
TRAIN_FILE=/your/path/to/train.json
# folder used to save fine-tuned checkpoints
OUTPUT_DIR=/your/path/to/save_checkpoints
# folder used to cache package dependencies
CACHE_DIR=/your/path/to/transformer_package_cache
export CUDA_VISIBLE_DEVICES=0,1,2,3
python -m torch.distributed.launch --nproc_per_node=4 run_seq2seq.py \
--train_file ${TRAIN_FILE} --output_dir ${OUTPUT_DIR} \
--model_type minilm --model_name_or_path minilm-l12-h384-uncased \
--do_lower_case --fp16 --fp16_opt_level O2 --max_source_seq_length 464 --max_target_seq_length 48 \
--per_gpu_train_batch_size 16 --gradient_accumulation_steps 1 \
--learning_rate 1e-4 --num_warmup_steps 500 --num_training_steps 108000 --cache_dir ${CACHE_DIR}
- The fine-tuning batch size =
number of gpus
*per_gpu_train_batch_size
*gradient_accumulation_steps
. So in the above example, the batch size is4*16*1 = 64
. The three arguments need to be adjusted together in order to remain the total batch size unchanged. --do_lower_case
: for uncased models
# path of the fine-tuned checkpoint
MODEL_PATH=/your/path/to/model_checkpoint
SPLIT=validation
# input file that you would like to decode
INPUT_JSON=/your/path/to/${SPLIT}.json
export CUDA_VISIBLE_DEVICES=0
export OMP_NUM_THREADS=4
export MKL_NUM_THREADS=4
python decode_seq2seq.py \
--fp16 --model_type minilm --tokenizer_name minilm-l12-h384-uncased --input_file ${INPUT_JSON} --split $SPLIT --do_lower_case \
--model_path ${MODEL_PATH} --max_seq_length 512 --max_tgt_length 48 --batch_size 32 --beam_size 5 \
--length_penalty 0 --forbid_duplicate_ngrams --mode s2s --forbid_ignore_word "."
- The decoding results are saved at
${MODEL_PATH}.${SPLIT}
. --do_lower_case
: for uncased models
The golden answer text files can be downloaded at here.
SPLIT=validation
GOLD_PATH=/your/path/to/${SPLIT}.target
# ${MODEL_PATH}.${SPLIT} is the predicted target file
python evaluations/eval_for_xsum.py --pred ${MODEL_PATH}.${SPLIT} --gold ${GOLD_PATH} --split ${SPLIT}
Pre-processed json dataset links: tokenized format.
# path of training data
export TRAIN_FILE=/your/path/to/train.json
# path used to cache training data
export CACHED_FEATURE_FILE=/your/path/to/cnndm_train.cased.features.pt
# folder used to save fine-tuned checkpoints
export OUTPUT_DIR=/your/path/to/save_checkpoints
# folder used to cache package dependencies
export CACHE_DIR=/your/path/to/transformer_package_cache
export CUDA_VISIBLE_DEVICES=0,1,2,3
python -m torch.distributed.launch --nproc_per_node=4 run_seq2seq.py \
--train_file $TRAIN_FILE --cached_train_features_file $CACHED_FEATURE_FILE --output_dir $OUTPUT_DIR \
--model_type unilm --model_name_or_path unilm1-base-cased --fp16 --fp16_opt_level O2 \
--max_source_seq_length 608 --max_target_seq_length 160 --per_gpu_train_batch_size 8 --gradient_accumulation_steps 2 \
--learning_rate 7e-5 --num_warmup_steps 1000 --num_training_steps 45000 --cache_dir $CACHE_DIR --save_steps 1500
- The fine-tuning batch size =
number of gpus
*per_gpu_train_batch_size
*gradient_accumulation_steps
. So in the above example, the batch size is4*8*2 = 64
. The three arguments need to be adjusted together in order to remain the total batch size unchanged. - A fine-tuned checkpoint is provided at here.
# path of the fine-tuned checkpoint
MODEL_PATH=/your/path/to/model_checkpoint
SPLIT=dev
# input file that you would like to decode
INPUT_JSON=/your/path/to/${SPLIT}.json
export CUDA_VISIBLE_DEVICES=0
export OMP_NUM_THREADS=4
export MKL_NUM_THREADS=4
python decode_seq2seq.py \
--fp16 --model_type unilm --tokenizer_name unilm1-base-cased --input_file ${INPUT_JSON} --split $SPLIT \
--model_path ${MODEL_PATH} --max_seq_length 768 --max_tgt_length 160 --batch_size 32 --beam_size 5 \
--length_penalty 0 --forbid_duplicate_ngrams --mode s2s --forbid_ignore_word "."
- The decoding results are saved at
${MODEL_PATH}.${SPLIT}
.
The golden answer text files can be downloaded at here.
SPLIT=dev
GOLD_PATH=/your/path/to/${SPLIT}.target
# ${MODEL_PATH}.${SPLIT} is the predicted target file
python evaluations/eval_for_cnndm.py --pred ${MODEL_PATH}.${SPLIT} --gold ${GOLD_PATH} --split ${SPLIT} --trunc_len 160
Pre-processed json dataset links: tokenized format.
# path of training data
export TRAIN_FILE=/your/path/to/train.json
# folder used to save fine-tuned checkpoints
export OUTPUT_DIR=/your/path/to/save_checkpoints
# folder used to cache package dependencies
export CACHE_DIR=/your/path/to/transformer_package_cache
export CUDA_VISIBLE_DEVICES=0,1,2,3
python -m torch.distributed.launch --nproc_per_node=4 run_seq2seq.py \
--train_file $TRAIN_FILE --output_dir $OUTPUT_DIR \
--model_type unilm --model_name_or_path unilm1.2-base-uncased --do_lower_case --fp16 --fp16_opt_level O2 \
--max_source_seq_length 608 --max_target_seq_length 160 --per_gpu_train_batch_size 8 --gradient_accumulation_steps 2 \
--learning_rate 7e-5 --num_warmup_steps 1000 --num_training_steps 45000 --cache_dir $CACHE_DIR --save_steps 1500
- The fine-tuning batch size =
number of gpus
*per_gpu_train_batch_size
*gradient_accumulation_steps
. So in the above example, the batch size is4*8*2 = 64
. The three arguments need to be adjusted together in order to remain the total batch size unchanged. --do_lower_case
: for uncased models
# path of the fine-tuned checkpoint
MODEL_PATH=/your/path/to/model_checkpoint
SPLIT=dev
# input file that you would like to decode
INPUT_JSON=/your/path/to/${SPLIT}.json
export CUDA_VISIBLE_DEVICES=0
export OMP_NUM_THREADS=4
export MKL_NUM_THREADS=4
python decode_seq2seq.py \
--fp16 --model_type unilm --tokenizer_name unilm1.2-base-uncased --do_lower_case --input_file ${INPUT_JSON} --split $SPLIT \
--model_path ${MODEL_PATH} --max_seq_length 768 --max_tgt_length 160 --batch_size 32 --beam_size 5 \
--length_penalty 0 --forbid_duplicate_ngrams --mode s2s --forbid_ignore_word "." --min_len 48
- The decoding results are saved at
${MODEL_PATH}.${SPLIT}
.
The golden answer text files can be downloaded at here.
SPLIT=dev
GOLD_PATH=/your/path/to/${SPLIT}.target
# ${MODEL_PATH}.${SPLIT} is the predicted target file
python evaluations/eval_for_cnndm.py --pred ${MODEL_PATH}.${SPLIT} --gold ${GOLD_PATH} --split ${SPLIT} --trunc_len 160
If you find this repository useful, please consider citing our work:
@article{s2s-ft,
title={s2s-ft: Fine-Tuning Pretrained Transformer Encoders for Sequence-to-Sequence Learning},
author={Hangbo Bao and Li Dong and Wenhui Wang and Nan Yang and Furu Wei},
year={2021},
eprint={2110.13640},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
Pre-processed json dataset (tokenized format) link
# choice the model to fine-tuning
export MODEL_NAME=unilm2-base-uncased
# path of training data
export TRAIN_FILE=/your/path/to/train.json
# folder used to save fine-tuned checkpoints
export OUTPUT_DIR=/your/path/to/save_checkpoints
# folder used to cache package dependencies
export CACHE_DIR=/your/path/to/transformer_package_cache
# learning rate
export LR=7e-5
# number of total training steps
export NUM_STEPS=48000
# target segment word drop prob
export TMP=0.4
# max length for source sequence
export MAX_SRC=720
# max length for target sequence
export MAX_TGT=48
export CUDA_VISIBLE_DEVICES=0,1,2,3
python -m torch.distributed.launch --nproc_per_node=4 run_seq2seq.py \
--train_file $TRAIN_FILE --output_dir $OUTPUT_DIR \
--model_type unilm --model_name_or_path unilm2-base-uncased --do_lower_case \
--fp16 --fp16_opt_level O2 \
--max_source_seq_length $MAX_SRC --max_target_seq_length $MAX_TGT \
--per_gpu_train_batch_size 8 --gradient_accumulation_steps 2 \
--learning_rate $LR --num_warmup_steps 1000 --num_training_steps $NUM_STEPS \
--cache_dir $CACHE_DIR --save_steps 1500 --target_mask_prob $TMP
- The fine-tuning batch size =
number of gpus
*per_gpu_train_batch_size
*gradient_accumulation_steps
. So in the above example, the batch size is4*8*2 = 64
. The three arguments need to be adjusted together in order to remain the total batch size unchanged. --do_lower_case
: for uncased models, no need for cased model--target_mask_prob
: target segment word drop prob. For the xsum dataset, we recommend a value of 0.4 or 0.5. For the CNN/DailyMail dataset, we recommend a value of 0.7 for 0.8.--learning_rate
: learning rate. For the base models, we recommend a value of 5e-5 ~ 1e-4. For the large models, we recommend a value of 1e-5 ~ 3e-5.
# tokenizer to decoding, same with the model name
export TOKENIZER_NAME=$MODEL_NAME # or unilm2-base-uncased
# path of the fine-tuned checkpoint
export MODEL_PATH=/your/path/to/model_checkpoint
export SPLIT=dev
# input file that you would like to decode
export INPUT_JSON=/your/path/to/${SPLIT}.json
# max length for source sequence
export MAX_SRC=720
# max length for target sequence
export MAX_TGT=48
export MAX_LEN=$(($SRC_LEN+$TGT_LEN))
# set minimum length for decoding target sequence
export MIN_LEN=1
export CUDA_VISIBLE_DEVICES=0
export OMP_NUM_THREADS=4
export MKL_NUM_THREADS=4
python decode_seq2seq.py \
--fp16 --model_type unilm --tokenizer_name unilm2-base-uncased --do_lower_case \
--input_file ${INPUT_JSON} --split $SPLIT --model_path ${MODEL_PATH} \
--max_seq_length $MAX_LEN --max_tgt_length $MAX_TGT \
--batch_size 24 --beam_size 8 \
--length_penalty 0.9 --forbid_duplicate_ngrams --mode s2s \
--forbid_ignore_word "." --min_len $MIN_LEN
- The decoding results are saved at
${MODEL_PATH}.${SPLIT}
. --do_lower_case
: for uncased models, no need for cased model
The golden answer text files can be downloaded at here.
SPLIT=dev
GOLD_PATH=/your/path/to/${SPLIT}.target
# ${MODEL_PATH}.${SPLIT} is the predicted target file
python evaluations/evl_for_xsum.py --pred ${MODEL_PATH}.${SPLIT} --gold ${GOLD_PATH} --split ${SPLIT}
This project is licensed under the license found in the LICENSE file in the root directory of this source tree. Portions of the source code are based on the transformers project.