🚀 YOPO: You Only Prune Once for Your MLLMs

Official code for our paper: Treat Visual Tokens as Text? But Your MLLM Only Needs Fewer Efforts to See

‼️ While many studies focus on pruning visual tokens to reduce the computational overhead caused by visual redundancy, the process of identifying these tokens for each conversation is itself resource-intensive. Now the question comes,

Can we prune our MLLM just once instead?

Core pruning strategies

Compared with text information, visual information is much more sparse, making it not necessary to use all parameters of MLLM for visual-related computation.

Neighbor-aware visual attention computation: Only spatial neighbor visual tokens are involved in the computation.
Non-active visual attention dropping: The attention weight ratio between the visual and text tokens can be used to evaluate the importance of attention heads in visual computation, thus helping to prune lazy neurons.
Sparse visual projection: Benefiting from the sparse visual representation, most neurons can be dropped in FFN visual computation.
Layer drop for visual computation: Stopping the visual-related computation for the last several layers.

Please refer to our paper for more details.

ToDO list

✅ pruning code for LLaVA
✅ checkpoints of pruned LLaVA
✅ pruning code for Qwen2-VL
✅ pruning code for InternVL

The currently released pruning code is used for simulation through the masking operation. The code integrating the proposed strategies into the KV-cache computation will be released soon.

Install

Set up LLavA https://github.com/haotian-liu/LLaVA

cd LLaVA
conda create -n llava python=3.10 -y
conda activate llava
pip install --upgrade pip  
pip install -e .
pip install -e ".[train]"
pip install flash-attn --no-build-isolation   
pip install transformers==4.36.2

Be sure to check the last step to degrade the version of transformers, otherwise there will be some issue during the inference.

Copy our updated modeling_llama.py to transformer library

cp ../modeling_llama_prune.py {YOUR ENV PATH}/lib/python3.10/site-packages/transformers//models/llama/modeling_llama.py
# eg. cp ../modeling_llama_prune.py /opt/conda/envs/llava/lib/python3.10/site-packages/transformers//models/llama/modeling_llama.py

Inference

Download the checkpoints of pruned LLaVA

LLaVA-1.5-7B (12% FLOPs)

LLaVA-1.5-7B (25% FLOPs)

LLaVA-1.5-13B (12% FLOPs)

LLaVA-1.5-13B (25% FLOPs)
Run inference

bash LLaVA/infer.sh

Training

Download and set up LLaVA-1.5 2nd stage training data https://github.com/haotian-liu/LLaVA/blob/main/docs/Data.md
Download LLaVA-1.5 mm_projector weights

https://huggingface.co/liuhaotian/llava-v1.5-mlp2x-336px-pretrain-vicuna-13b-v1.5

https://huggingface.co/liuhaotian/llava-v1.5-mlp2x-336px-pretrain-vicuna-7b-v1.5

put them into ./checkpoints/llava-v1.5-13b-pretrain and ./checkpoints/llava-v1.5-7b-pretrain respectively
Run training

bash scripts/v1_5/finetune_yopo.sh

Evaluation

We evaluated our model on multiple visual question-answering and reasoning benchmarks, including VQAv2, GQA, ScienceQA, TextVQA, POPE, MME, and MMBench.
For evaluation, you can use either LLaVA eval or lmms-eval:
- LLaVA eval: Detailed setup instructions can be found here.
- lmms-eval: Detailed setup instructions can be found here.

For InternVL2-4B/8B/26B inference pruning without fine-tuning

please follow the instructions in InternVL-2 to install the InternVL: https://internvl.github.io/blog/2024-07-02-InternVL-2.0/
For convenience, we provide the pruning models here (note that these models have the same weights with the original one, we include the inference code with pruning stratgies in corresponding repo.)
https://huggingface.co/zwt123home123/InternVL2-4B-YOPO
https://huggingface.co/zwt123home123/InternVL2-8B-YOPO
https://huggingface.co/zwt123home123/InternVL2-26B-YOPO

What we change in the inference code?
https://huggingface.co/zwt123home123/InternVL2-26B-YOPO/blob/main/modeling_internlm2.py#L311-L317
https://huggingface.co/zwt123home123/InternVL2-26B-YOPO/blob/main/modeling_internlm2.py#L446-L470

For evaluation,

model = AutoModel.from_pretrained(
    path,
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    use_flash_attn=False,
    trust_remote_code=True).eval().cuda()

Be sure to set use_flash_attn=False

For Qwen2-vl-8B inference pruning without fine-tuning

please follow the instructions in Qwen2-vl-8B to install the InternVL: https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct
Copy our updated modeling_qwen2_vl.py to transformer library

cp modeling_qwen2_vl.py {YOUR ENV PATH}/lib/python3.10/site-packages/transformers/models/qwen2_vl/modeling_qwen2_vl.py

The computation of headmask

We provide the code to compute the headmask for pruning non-activate visual attention heads in gen_mask_llava.py and gen_mask_internvl_qwen.py.

The headmask for different models can be found in https://drive.google.com/drive/folders/17xPC4pPTs-7WQDoRvjVu1ZYRa7y7GAZE?usp=sharing

You can generate the headmask for your own model based on your own calibration dataset refering to the code here: https://huggingface.co/zwt123home123/InternVL2-8B-YOPO/blob/main/modeling_internlm2.py#L462-L471

License

This project is released under the MIT license. Parts of this project contain code and models from other sources, which are subject to their respective licenses.

Citation

If you find the idea or code useful for your research, please consider citing our paper:

@article{zhang2024treat,
  title={Treat Visual Tokens as Text? But Your MLLM Only Needs Fewer Efforts to See},
  author={Zhang, Zeliang and Pham, Phu and Zhao, Wentian and Wan, Kun and Li, Yu-Jhe and Zhou, Jianing and Miranda, Daniel and Kale, Ajinkya and Xu, Chenliang},
  journal={arXiv preprint arXiv:2410.06169},
  year={2024}
}

Contact

Questions and suggestions can be sent to [email protected] and {wezhao, kuwan}@adobe.com.

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
LLaVA		LLaVA
images		images
LICENSE		LICENSE
README.md		README.md
gen_mask_internvl_qwen.py		gen_mask_internvl_qwen.py
gen_mask_llava.py		gen_mask_llava.py
modeling_llama_prune.py		modeling_llama_prune.py
modeling_qwen2_vl.py		modeling_qwen2_vl.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🚀 YOPO: You Only Prune Once for Your MLLMs

Core pruning strategies

ToDO list

Install

Inference

Training

Evaluation

For InternVL2-4B/8B/26B inference pruning without fine-tuning

For Qwen2-vl-8B inference pruning without fine-tuning

The computation of headmask

License

Citation

Contact

About

Releases

Packages

Contributors 3

Languages

License

ZhangAIPI/YOPO_MLLM_Pruning

Folders and files

Latest commit

History

Repository files navigation

🚀 YOPO: You Only Prune Once for Your MLLMs

Core pruning strategies

ToDO list

Install

Inference

Training

Evaluation

For InternVL2-4B/8B/26B inference pruning without fine-tuning

For Qwen2-vl-8B inference pruning without fine-tuning

The computation of headmask

License

Citation

Contact

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages