DistiLLM: Towards Streamlined Distillation for Large Language Models (ICML 2024)

Official PyTorch implementation of DistiLLM, as presented in our paper:

DistiLLM: Towards Streamlined Distillation for Large Language Models
Jongwoo Ko, Sungnyun Kim, Tianyi Chen, Se-Young Yun
KAIST AI and Microsoft

🚀 Updates

(24.08.12) Remove the dependency on the local transformers, which are outdated. You can work with various types of recent LLMs!
(24.05.01) Our paper has been accepted in ICML 2024. We are open to receiving any discussions and will reflect them in the camera-ready version. Looking forward to seeing you in Vienna!
(24.03.13) Release LoRA checkpoints for OpenLLaMa2-3B

Environment

bash install.sh

Our code is based on this commit of HuggingFace Transformers by following MiniLLM.

Data

Resources

The training/evaluation intruction-response data before processing can be downloaded from this link.
The plain-text corpus $D_{PT}$ can be download from the HugginFace datasets repository.

Data Processing

Get plain-text corpus $D_{PT}$ :

python3 tools/get_openwebtext.py

This script will replace the continuous \n in each document with a special token "<@x(x!>" and write each document in OpenWebText in a line, which is convenient for parallel processing. In data/openwebtext/data.txt, we give an example of the resulting format. You can follow this format to prepare other corpus beyond OpenWebText.

Tokenize the data and store them in binary files:

bash scripts/gpt2/tools/process_data_dolly.sh ${/PATH/TO/DistiLLM} ${MASTER_PORT} ${GPU_NUM} # Process Dolly Train / Validation Data
bash scripts/gpt2/tools/process_data_pretrain.sh ${/PATH/TO/DistiLLM} ${MASTER_PORT} ${GPU_NUM} # Process OpenWebText Train / Validation Data

bash scripts/opt/tools/process_data_dolly.sh ${/PATH/TO/DistiLLM} ${MASTER_PORT} ${GPU_NUM} # Process Dolly Train / Validation Data
bash scripts/opt/tools/process_data_pretrain.sh ${/PATH/TO/DistiLLM} ${MASTER_PORT} ${GPU_NUM} # Process OpenWebText Corpus Train / Validation Data

bash scripts/llama/tools/process_data_dolly.sh ${/PATH/TO/DistiLLM} ${MASTER_PORT} ${GPU_NUM} # Process Dolly Train / Validation Data
bash scripts/llama/tools/process_data_pretrain.sh ${/PATH/TO/DistiLLM} ${MASTER_PORT} ${GPU_NUM} # Process OpenWebText Corpus Train / Validation Data

Base Pre-trained Models

To run fine-tuning or standard KD baselines, you need to download the model checkpoints from [Huggingface Model Hub] and put them in checkpoints/. For example, for gpt2-large, you can download the model from this link and put them in checkpoints/gpt2-large.

Alternatively, you can also change the CKPT variable in each script to the corresponding model name to enable Transformers to download the base models automatically. For example, set CKPT="gpt2-large" in scripts/gpt2/sft/sft_large.sh causes download of the gpt2-large base model from the HugginFace model hub.

Train

We provide example commands for GPT-2 models. Similar scripts for model families can be found in scripts/opt and scripts/openllama2. All our experiments are conducted on 4 * 40A100, which can be reduced for small models.

Baselines

The final checkpoints are selected by the ROUGE-L scores.

Fine-tune the teacher models

bash scripts/gpt2/sft/sft_xlarge.sh ${/PATH/TO/DistiLLM} ${MASTER_PORT} ${GPU_NUM}

SFT Baselines

bash scripts/gpt2/sft/sft_base.sh ${/PATH/TO/DistiLLM} ${MASTER_PORT} ${GPU_NUM}
bash scripts/gpt2/sft/sft_medium.sh ${/PATH/TO/DistiLLM} ${MASTER_PORT} ${GPU_NUM}
bash scripts/gpt2/sft/sft_large.sh ${/PATH/TO/DistiLLM} ${MASTER_PORT} ${GPU_NUM}

KD Baselines

bash scripts/gpt2/kd/kd_base.sh ${/PATH/TO/DistiLLM} ${MASTER_PORT} ${GPU_NUM}
bash scripts/gpt2/kd/kd_medium.sh ${/PATH/TO/DistiLLM} ${MASTER_PORT} ${GPU_NUM}
bash scripts/gpt2/kd/kd_large.sh ${/PATH/TO/DistiLLM} ${MASTER_PORT} ${GPU_NUM}

SeqKD Baselines

Generate and process responses with the teacher:

bash scripts/gpt2/tools/generate_data_seqkd.sh ${/PATH/TO/DistiLLM} ${MASTER_PORT} ${GPU_NUM}
bash scripts/gpt2/tools/process_pseudo_data_seqkd.sh ${/PATH/TO/DistiLLM} ${MASTER_PORT} ${GPU_NUM}

Fine-tune the model with SeqKD:

bash scripts/gpt2/seqkd/seqkd_base.sh ${/PATH/TO/DistiLLM} ${MASTER_PORT} ${GPU_NUM}
bash scripts/gpt2/seqkd/seqkd_medium.sh ${/PATH/TO/DistiLLM} ${MASTER_PORT} ${GPU_NUM}
bash scripts/gpt2/seqkd/seqkd_large.sh ${/PATH/TO/DistiLLM} ${MASTER_PORT} ${GPU_NUM}

Student Initialization

The final checkpoints are selected by the validation loss.

bash scripts/gpt2/init/init_base.sh ${/PATH/TO/DistiLLM} ${MASTER_PORT} ${GPU_NUM}
bash scripts/gpt2/init/init_medium.sh ${/PATH/TO/DistiLLM} ${MASTER_PORT} ${GPU_NUM}
bash scripts/gpt2/init/init_large.sh ${/PATH/TO/DistiLLM} ${MASTER_PORT} ${GPU_NUM}

ImitKD Baselines

bash scripts/gpt2/imitkd/imitkd_base_xl.sh ${/PATH/TO/DistiLLM} ${MASTER_PORT} ${GPU_NUM}
bash scripts/gpt2/imitkd/imitkd_medium_xl.sh ${/PATH/TO/DistiLLM} ${MASTER_PORT} ${GPU_NUM}
bash scripts/gpt2/imitkd/imitkd_large_xl.sh ${/PATH/TO/DistiLLM} ${MASTER_PORT} ${GPU_NUM}

MiniLLM Baselines

bash scripts/gpt2/minillm/train_base_xl.sh ${/PATH/TO/DistiLLM} ${MASTER_PORT} ${GPU_NUM}
bash scripts/gpt2/minillm/train_medium_xl.sh ${/PATH/TO/DistiLLM} ${MASTER_PORT} ${GPU_NUM}
bash scripts/gpt2/minillm/train_large_xl.sh ${/PATH/TO/DistiLLM} ${MASTER_PORT} ${GPU_NUM}

GKD Baselines

bash scripts/gpt2/gkd/gkd_base_xl.sh ${/PATH/TO/DistiLLM} ${MASTER_PORT} ${GPU_NUM}
bash scripts/gpt2/gkd/gkd_medium_xl.sh ${/PATH/TO/DistiLLM} ${MASTER_PORT} ${GPU_NUM}
bash scripts/gpt2/gkd/gkd_large_xl.sh ${/PATH/TO/DistiLLM} ${MASTER_PORT} ${GPU_NUM}

DistiLLM

The final checkpoints are selected by the validation loss.

bash scripts/gpt2/init/init_base.sh ${/PATH/TO/DistiLLM} ${MASTER_PORT} ${GPU_NUM}
bash scripts/gpt2/init/init_medium.sh ${/PATH/TO/DistiLLM} ${MASTER_PORT} ${GPU_NUM}
bash scripts/gpt2/init/init_large.sh ${/PATH/TO/DistiLLM} ${MASTER_PORT} ${GPU_NUM}

The final checkpoints are selected by the ROUGE-L scores.

bash scripts/gpt2/distillm/train_base_xl.sh ${/PATH/TO/DistiLLM} ${MASTER_PORT} ${GPU_NUM}
bash scripts/gpt2/distillm/train_medium_xl.sh ${/PATH/TO/DistiLLM} ${MASTER_PORT} ${GPU_NUM}
bash scripts/gpt2/distillm/train_large_xl.sh ${/PATH/TO/DistiLLM} ${MASTER_PORT} ${GPU_NUM}

Run Evaluation

bash scripts/gpt2/eval/run_eval.sh ${GPU_IDX} ${/PATH/TO/DistiLLM}
bash scripts/opt/eval/run_eval.sh ${GPU_IDX} ${/PATH/TO/DistiLLM} 
bash scripts/openllama2/eval/run_eval.sh ${GPU_IDX} ${/PATH/TO/DistiLLM}

Results

DistiLLM outperforms other KD baselines in terms of both generation performance and training speed for various model families such as GPT-2, OPT, and OpenLLaMA.

Checkpoints (OpenLLaMA-3B)

We share the LoRA weights for OpenLLaMA-3B in google drive.

Acknowledgement

Our code is based on the code of ICLR2024 MiniLLM: Knowledge Distillation of Large Language Models.

Star History

BibTeX

If you find this repo useful for your research, please consider citing our paper:

@inproceedings{kodistillm,
  title={DistiLLM: Towards Streamlined Distillation for Large Language Models},
  author={Ko, Jongwoo and Kim, Sungnyun and Chen, Tianyi and Yun, Se-Young},
  booktitle={Forty-first International Conference on Machine Learning}
}

Contact

Jongwoo Ko: [email protected]

Name	Name	Last commit message	Last commit date
Latest commit Averyyy feat: distilling a plain ebwm Jan 30, 2025 4a9ace9 · Jan 30, 2025 History 16 Commits
configs	configs	feat: init ebwm	Jan 9, 2025
data_utils	data_utils	first commit	Feb 6, 2024
distillm	distillm	Update __init__.py	Sep 9, 2024
ebwm	ebwm	feat: distilling a plain ebwm	Jan 30, 2025
logs	logs	feat: distilling a plain ebwm	Jan 30, 2025
minillm	minillm	first commit	Feb 6, 2024
output	output	feat: requirements.txt	Dec 19, 2024
scripts	scripts	feat: distilling a plain ebwm	Jan 30, 2025
slurms	slurms	feat: distilling a plain ebwm	Jan 30, 2025
tools	tools	feat: init ebwm	Jan 9, 2025
wandb	wandb	feat: distilling a plain ebwm	Jan 30, 2025
.gitignore	.gitignore	feat: init ebwm	Jan 9, 2025
README.md	README.md	Update README.md	Sep 20, 2024
arguments.py	arguments.py	feat: distilling a plain ebwm	Jan 30, 2025
evaluate.py	evaluate.py	[Update] Removing local transformers dependency	Aug 13, 2024
evaluate_main.py	evaluate_main.py	[Update] Removing local transformers dependency	Aug 13, 2024
finetune.py	finetune.py	feat: init ebwm	Jan 9, 2025
finetune_ebwm.py	finetune_ebwm.py	feat: distilling a plain ebwm	Jan 30, 2025
generate.py	generate.py	feat: requirements.txt	Dec 19, 2024
install.sh	install.sh	feat: requirements.txt	Dec 19, 2024
requirements.txt	requirements.txt	feat: requirements.txt	Dec 19, 2024
rouge_metric.py	rouge_metric.py	first commit	Feb 6, 2024
test_config.json	test_config.json	feat: requirements.txt	Dec 19, 2024
train_ebwm.py	train_ebwm.py	feat: distilling a plain ebwm	Jan 30, 2025
train_minillm.py	train_minillm.py	first commit	Feb 6, 2024
utils.py	utils.py	feat: distilling a plain ebwm	Jan 30, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DistiLLM: Towards Streamlined Distillation for Large Language Models (ICML 2024)

🚀 Updates

Environment

Data

Resources

Data Processing

Base Pre-trained Models

Train

Baselines

Fine-tune the teacher models

SFT Baselines

KD Baselines

SeqKD Baselines

Student Initialization

ImitKD Baselines

MiniLLM Baselines

GKD Baselines

DistiLLM

Run Evaluation

Results

Checkpoints (OpenLLaMA-3B)

Acknowledgement

Star History

BibTeX

Contact

About

Releases

Packages

Languages

Averyyy/distillEBWM

Folders and files

Latest commit

History

Repository files navigation

DistiLLM: Towards Streamlined Distillation for Large Language Models (ICML 2024)

🚀 Updates

Environment

Data

Resources

Data Processing

Base Pre-trained Models

Train

Baselines

Fine-tune the teacher models

SFT Baselines

KD Baselines

SeqKD Baselines

Student Initialization

ImitKD Baselines

MiniLLM Baselines

GKD Baselines

DistiLLM

Run Evaluation

Results

Checkpoints (OpenLLaMA-3B)

Acknowledgement

Star History

BibTeX

Contact

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages