Deyao Zhu* (On Job Market!), Jun Chen* (On Job Market!), Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. *Equal Contribution
King Abdullah University of Science and Technology
Click the image to chat with MiniGPT-4 around your images
More examples can be found in the project page.
- MiniGPT-4 aligns a frozen visual encoder from BLIP-2 with a frozen LLM, Vicuna, using just one projection layer.
- The training of MiniGPT-4 consists of a first pretrain stage using roughly 5 million aligned image-text pairs for 10 hours on 4 A100s and a second finetuning stage using additional 3,500 carefully curated high-quality pairs for 7 minutes on 1 A100.
- MiniGPT-4 processes many emerging vision-language capabilities similar to those exhibited by GPT-4.
1. Prepare the code and the environment
Git clone our repository, creating a python environment and ativate it via the following command
git clone https://github.com/Vision-CAIR/MiniGPT-4.git
cd MiniGPT-4
conda env create -f environment.yml
conda activate minigpt4
2. Prepare the pretrained Vicuna weights
The current version of MiniGPT-4 is built on the v0 versoin of Vicuna-13B. Please refer to their instructions here to obtaining the weights. The final weights would be in a single folder with the following structure:
vicuna_weights
├── config.json
├── generation_config.json
├── pytorch_model.bin.index.json
├── pytorch_model-00001-of-00003.bin
...
Then, set the path to the vicuna weight in the model config file here at Line 16.
3. Prepare the pretrained MiniGPT-4 checkpoint
To play with our pretrained model, download the pretrained checkpoint here. Then, set the path to the pretrained checkpoint in the evaluation config file in eval_configs/minigpt4_eval.yaml at Line 10.
Try out our demo demo.py on your local machine by running
python demo.py --cfg-path eval_configs/minigpt4_eval.yaml
The training of MiniGPT-4 contains two alignment stages.
1. First pretraining stage
In the first pretrained stage, the model is trained using image-text pairs from Laion and CC datasets to align the vision and language model. To download and prepare the datasets, please check our first stage dataset preparation instruction. After the first stage, the visual features are mapped and can be understood by the language model. To launch the first stage training, run the following command. In our experiments, we use 4 A100. You can change the save path in the config file train_configs/minigpt4_stage1_pretrain.yaml
torchrun --nproc-per-node NUM_GPU train.py --cfg-path train_configs/minigpt4_stage1_pretrain.yaml
1. Second finetuning stage
In the second stage, we use a small high quality image-text pair dataset created by ourselves and convert it to a conversation format to further align MiniGPT-4. To download and prepare our second stage dataset, please check our second stage dataset preparation instruction. To launch the second stage alignment, first specify the path to the checkpoint file trained in stage 1 in train_configs/minigpt4_stage1_pretrain.yaml. You can also specify the output path there. Then, run the following command. In our experiments, we use 1 A100.
torchrun --nproc-per-node NUM_GPU train.py --cfg-path train_configs/minigpt4_stage2_finetune.yaml
After the second stage alignment, MiniGPT-4 is able to talk about the image coherently and user-friendly.
If you're using MiniGPT-4 in your research or applications, please cite using this BibTeX:
@misc{zhu2022minigpt4,
title={MiniGPT-4: Enhancing Vision-language Understanding with Advanced Large Language Models},
author={Deyao Zhu and Jun Chen and Xiaoqian Shen and xiang Li and Mohamed Elhoseiny},
year={2023},
}
This repository is under BSD 3-Clause License. Many codes are based on Lavis with BSD 3-Clause License here.