Lite Transformer with Long-Short Range Attention

@inproceedings{Wu2020LiteTransformer,
  title={Lite Transformer with Long-Short Range Attention},
  author={Zhanghao Wu* and Zhijian Liu* and Ji Lin and Yujun Lin and Song Han},
  booktitle={International Conference on Learning Representations (ICLR)},
  year={2020}
}

Overview

We release the PyTorch code for the Lite Transformer. [Paper|Website|Slides]:

Consistent Improvement by Tradeoff Curves

Save 20000x Searching Cost of Evolved Transformer

Further Compress Transformer by 18.2x

How to Use

Prerequisite

Python version >= 3.6
PyTorch version >= 1.0.0
configargparse >= 0.14
For training new models, you'll also need an NVIDIA GPU and NCCL

Installation

Codebase

To install fairseq from source and develop locally:
```
pip install --editable .
```

Costumized Modules

We also need to build the lightconv and dynamicconv for GPU support.

Lightconv_layer

cd fairseq/modules/lightconv_layer
python cuda_function_gen.py
python setup.py install

Dynamicconv_layer

cd fairseq/modules/dynamicconv_layer
python cuda_function_gen.py
python setup.py install

Data Preparation

IWSLT'14 De-En

We follow the data preparation in fairseq. To download and preprocess the data, one can run

bash configs/iwslt14.de-en/prepare.sh

WMT'14 En-Fr

We follow the data pre-processing in fairseq. To download and preprocess the data, one can run

bash configs/wmt14.en-fr/prepare.sh

WMT'16 En-De

We follow the data pre-processing in fairseq. One should first download the preprocessed data from the Google Drive provided by Google. To binarized the data, one can run

bash configs/wmt16.en-de/prepare.sh [path to the downloaded zip file]

WIKITEXT-103

As the language model task has many additional codes, we place it in another branch: language-model. We follow the data pre-processing in fairseq. To download and preprocess the data, one can run

git checkout language-model
bash configs/wikitext-103/prepare.sh

Testing

For example, to test the models on WMT'14 En-Fr, one can run

configs/wmt14.en-fr/test.sh [path to the model checkpoints] [gpu-id] [test|valid]

For instance, to evaluate Lite Transformer on GPU 0 (with the BLEU score on test set of WMT'14 En-Fr), one can run

configs/wmt14.en-fr/test.sh embed496/ 0 test

We provide several pretrained models at the bottom. You can download the model and extract the file by

tar -xzvf [filename]

Training

We provided several examples to train Lite Transformer with this repo:

To train Lite Transformer on WMT'14 En-Fr (with 8 GPUs), one can run

python train.py data/binary/wmt14_en_fr --configs configs/wmt14.en-fr/attention/multibranch_v2/embed496.yml

To train Lite Transformer with less GPUs, e.g. 4 GPUS, one can run

CUDA_VISIBLE_DEVICES=0,1,2,3 python train.py data/binary/wmt14_en_fr --configs configs/wmt14.en-fr/attention/multibranch_v2/embed496.yml --update-freq 32

In general, to train a model, one can run

python train.py [path to the data binary] --configs [path to config file] [override options]

Note that --update-freq should be adjusted according to the GPU numbers (16 for 8 GPUs, 32 for 4 GPUs).

Distributed Training (optional)

To train Lite Transformer in distributed manner. For example on two GPU nodes with totally 16 GPUs.

# On host1
python -m torch.distributed.launch \
        --nproc_per_node=8 \
        --nnodes=2 --node_rank=0 \
        --master_addr=host1 --master_port=8080 \
        train.py data/binary/wmt14_en_fr \
        --configs configs/wmt14.en-fr/attention/multibranch_v2/embed496.yml \
        --distributed-no-spawn \
        --update-freq 8
# On host2
python -m torch.distributed.launch \
        --nproc_per_node=8 \
        --nnodes=2 --node_rank=1 \
        --master_addr=host1 --master_port=8080 \
        train.py data/binary/wmt14_en_fr \
        --configs configs/wmt14.en-fr/attention/multibranch_v2/embed496.yml \
        --distributed-no-spawn \
        --update-freq 8

Models

We provide the checkpoints for our Lite Transformer reported in the paper:

Dataset	#Mult-Adds	Test Score	Model and Test Set
WMT'14 En-Fr	90M	35.3	download
	360M	39.1	download
	527M	39.6	download
WMT'16 En-De	90M	22.5	download
	360M	25.6	download
	527M	26.5	download
CNN / DailyMail	800M	38.3 (R-L)	download
WIKITEXT-103	1147M	22.2 (PPL)	download

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
configs		configs
fairseq		fairseq
figures		figures
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
generate.py		generate.py
preprocess.py		preprocess.py
score.py		score.py
setup.py		setup.py
train.py		train.py
validate.py		validate.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Lite Transformer with Long-Short Range Attention

Overview

Consistent Improvement by Tradeoff Curves

Save 20000x Searching Cost of Evolved Transformer

Further Compress Transformer by 18.2x

How to Use

Prerequisite

Installation

Data Preparation

IWSLT'14 De-En

WMT'14 En-Fr

WMT'16 En-De

WIKITEXT-103

Testing

Training

Distributed Training (optional)

Models

About

Releases

Packages

Languages

License

dlhuang/lite-transformer

Folders and files

Latest commit

History

Repository files navigation

Lite Transformer with Long-Short Range Attention

Overview

Consistent Improvement by Tradeoff Curves

Save 20000x Searching Cost of Evolved Transformer

Further Compress Transformer by 18.2x

How to Use

Prerequisite

Installation

Data Preparation

IWSLT'14 De-En

WMT'14 En-Fr

WMT'16 En-De

WIKITEXT-103

Testing

Training

Distributed Training (optional)

Models

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages