MG-LLaVA: Towards Multi-Granularity Visual Instruction Tuning

MG-LLaVA: Towards Multi-Granularity Visual Instruction Tuning

Xiangyu Zhao, Xiangtai Li, Haodong Duan, Haian Huang, Yining Li, Kai Chen, Hua Yang

🎉 News

[2024/09] MG-LLaVA inference code released! Please refer to Inference for more details.
[2024/08] MG-LLaVA now supports the evalution of MMVet, LLaVA-Bench-in-the-wild, MMVP, and MathVista benchmarks!
[2024/06] Our paper, code and weights are all released.

📖 Introduction

we present MG-LLaVA, an innovative MLLM that enhances the model's visual processing capabilities by incorporating a multi-granularity vision flow, which includes low-resolution, high-resolution, and object-centric features. We propose the integration of an additional high-resolution visual encoder to capture fine-grained details, which are then fused with base visual features through a Conv-Gate fusion network. To further refine the model's object recognition abilities, we incorporate object-level features derived from bounding boxes identified by offline detectors. Being trained solely on publicly available multimodal data through instruction tuning, MG-LLaVA demonstrates exceptional perception skills.

🔥 Main Results

🛠️ Quick Start

Installation

It is recommended to build a Python-3.10 virtual environment using conda

conda create --name mgllava-env python=3.10 -y
conda activate mgllava-env

Install XTuner from source

git clone https://github.com/PhoenixZ810/MG-LLaVA.git
cd MG-LLaVA
pip install -e '.[all]'

Data Preparation

Please refer to dataset_prepare.md.

Model Weights

Our checkpoints are available at ModelZoo.

Before Train

MG-LLaVA employed several LLMs ranged from 3.8B to 34B, including Phi-3-3.8B, Vicuna1.5-7B, Vicuna1.5-13B, llama3-8B, and Yi1.5-34B. We employ CLIP-Large-336 and CLIP-ConvNext-320-d as vision encoders, you should download both the LLM and CLIP checkpoints before training.

The training process is similar to the original XTuner. Before training, you should check the configs and modify the following variables to your own settings. You can also modify the configs to train the model with your own settings.

# Path of LLM and CLIP
llm_name_or_path
visual_encoder_name_or_path
visual_encoder_aux_path
prompt_template

# Data
data_path
box_json_path
image_folder
offline_processed_text_folder(optional)

# Training
pretrained_pth(Fine-Tuning)

Before training, you can use the following command to preprocess the text data to speed up the training process. You can preprocess the text data by running the following command:

python xtuner/tools/process_untokenized_llava_data.py CONFIG --save-folder TEXT-PATH

and then set the offline_processed_text_folder in the config file to TEXT-PATH.

Train & Evaluation

MG-LLaVA follows a two-stage training process, the entire training process takes approximately 23 hours when using the Vicuna1.5-7B model using 8×A100 GPUs. For example, to train the MG-LLaVA model with Vicuna1.5-7B, you can use the following command:

Entire Pipeline: Pretraining + Fine-tuning + Evaluation
```
bash script/train_vicuna7B.sh
```

If you want to train our model step by step, you can follow the instructions below:

Step 1, start pretraining.

bash script/train_pretrain.sh mg_llava/config/vicuna/fuse_vicuna7b_clip_L_14_336_pretrain_padding.py

Step 2, start fine-tuning.
```
bash script/train_sft.sh mg_llava/config/vicuna/fuse_vicuna7b_clip_L_14_336_sft_padding.py
```
- --deepspeed means using DeepSpeed 🚀 to optimize the training. XTuner comes with several integrated strategies including ZeRO-1, ZeRO-2, and ZeRO-3. If you wish to disable this feature, simply remove this argument.
- For more examples, please see finetune.md.
Step 3, evaluation. The evaluation metrics are specified in the sft configuration, including MMBench, SEED, SQA, AI2D, TextVQA, POPE, GQA, VQAv2, and additional ones. Please refer to evaluation.md.

You can convert the saved PTH model (if using DeepSpeed, it will be a directory) to Hugging Face model, by
```
xtuner convert pth_to_hf CONFIG_NAME_OR_PATH CHECKPOINT SAVE_PATH
```

Inference

Before inference, you need to download MG-LLaVA checkpoints and corresponding LLM model. In addition, CLIP-Large-336, CLIP-ConvNext-320-d, RAM and OWL-VIT-2 are also required.

The code for inference is available at chat.py. You can use the following command to run the inference code in chat.sh and chat with MG-LLaVA.

srun -p mllm_1 \
    --gres=gpu:1 \
    python mg_llava/module/chat.py \
    PATH TO MG-LLaVA-Vicuna-7B MODEL \
    --llm_name_or_path 'PATH TO Vicuna1.5-7B LLM' \
    --visual_encoder_clip 'PATH TO CLIP MODEL' \
    --visual_encoder_convnext 'PATH TO ConvNext MODEL' \
    --ram_model 'PATH TO RAM MODEL' \
    --owl_vit_model 'PATH TO OWL-VIT-2 MODEL' \
    --prompt-template 'vicuna' \
    --image examples/example.jpg

Citation

If you find MG-LLaVA useful, please cite using this BibTeX:

@article{zhao2024mg,
  title={MG-LLaVA: Towards Multi-Granularity Visual Instruction Tuning},
  author={Zhao, Xiangyu and Li, Xiangtai and Duan, Haodong and Huang, Haian and Li, Yining and Chen, Kai and Yang, Hua},
  journal={arXiv preprint arXiv:2406.17770},
  year={2024}
}

Acknowledgement

Xtuner: the codebase we built upon.
LLaVA: the base model structure.

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
.github		.github
docs		docs
examples		examples
images		images
mg_llava		mg_llava
requirements		requirements
script		script
xtuner		xtuner
.gitignore		.gitignore
.owners.yml		.owners.yml
.pre-commit-config-zh-cn.yaml		.pre-commit-config-zh-cn.yaml
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
dataset_prepare.md		dataset_prepare.md
evaluation.md		evaluation.md
requirements.txt		requirements.txt
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MG-LLaVA: Towards Multi-Granularity Visual Instruction Tuning

🎉 News

📖 Introduction

🔥 Main Results

🛠️ Quick Start

Installation

Data Preparation

Model Weights

Before Train

Train & Evaluation

Inference

Citation

Acknowledgement

About

Releases

Packages

Languages

License

PhoenixZ810/MG-LLaVA

Folders and files

Latest commit

History

Repository files navigation

MG-LLaVA: Towards Multi-Granularity Visual Instruction Tuning

🎉 News

📖 Introduction

🔥 Main Results

🛠️ Quick Start

Installation

Data Preparation

Model Weights

Before Train

Train & Evaluation

Inference

Citation

Acknowledgement

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages