forked from Vision-CAIR/MiniGPT-4
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Deyao Zhu
committed
Apr 16, 2023
0 parents
commit f1a33af
Showing
111 changed files
with
8,792 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,14 @@ | ||
BSD 3-Clause License | ||
|
||
Copyright 2023 Deyao Zhu | ||
All rights reserved. | ||
|
||
Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: | ||
|
||
1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. | ||
|
||
2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. | ||
|
||
3. Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. | ||
|
||
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,14 @@ | ||
BSD 3-Clause License | ||
|
||
Copyright (c) 2022 Salesforce, Inc. | ||
All rights reserved. | ||
|
||
Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: | ||
|
||
1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. | ||
|
||
2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. | ||
|
||
3. Neither the name of Salesforce.com nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. | ||
|
||
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. |
Binary file not shown.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,145 @@ | ||
# MiniGPT-4: Enhancing Vision-language Understanding with Advanced Large Language Models | ||
[Deyao Zhu](https://tsutikgiau.github.io/)* (On Job Market!), [Jun Chen](https://junchen14.github.io/)* (On Job Market!), [Xiaoqian Shen](https://xiaoqian-shen.github.io), Xiang Li, and Mohamed Elhoseiny. *Equal Contribution | ||
|
||
**King Abdullah University of Science and Technology** | ||
|
||
<a href='https://minigpt-4.github.io'><img src='https://img.shields.io/badge/Project-Page-Green'></a> <a href='MiniGPT_4.pdf'><img src='https://img.shields.io/badge/Paper-PDF-red'></a> | ||
|
||
|
||
## Online Demo | ||
|
||
Click the image to chat with MiniGPT-4 around your images | ||
[![demo](figs/online_demo.png)](https://minigpt-4.github.io) | ||
|
||
|
||
## Examples | ||
| | | | ||
:-------------------------:|:-------------------------: | ||
![find wild](figs/examples/wop_2.png) | ![write story](figs/examples/ad_2.png) | ||
![solve problem](figs/examples/fix_1.png) | ![write Poem](figs/examples/rhyme_1.png) | ||
|
||
More examples can be found in the [project page](https://minigpt-4.github.io). | ||
|
||
|
||
|
||
## Introduction | ||
- MiniGPT-4 aligns a frozen visual encoder from BLIP-2 with a frozen LLM, Vicuna, using just one projection layer. | ||
- The training of MiniGPT-4 consists of a first pretrain stage using roughly 5 million aligned image-text pairs for 10 hours on 4 A100s and a second finetuning stage using additional 3,500 carefully curated high-quality pairs for 7 minutes on 1 A100. | ||
- MiniGPT-4 processes many emerging vision-language capabilities similar to those exhibited by GPT-4. | ||
![overview](figs/overview.png) | ||
|
||
|
||
|
||
|
||
|
||
|
||
## Getting Started | ||
### Installation | ||
|
||
**1. Prepare the code and the environment** | ||
|
||
Git clone our repository, creating a python environment and ativate it via the following command | ||
|
||
```bash | ||
git clone https://github.com/Vision-CAIR/MiniGPT-4.git | ||
cd MiniGPT-4 | ||
conda env create -f environment.yml | ||
conda activate minigpt4 | ||
``` | ||
|
||
|
||
**2. Prepare the pretrained Vicuna weights** | ||
|
||
The current version of MiniGPT-4 is built on the v0 versoin of Vicuna-13B. | ||
Please refer to their instructions [here](https://huggingface.co/lmsys/vicuna-13b-delta-v0) to obtaining the weights. | ||
The final weights would be in a single folder with the following structure: | ||
|
||
``` | ||
vicuna_weights | ||
├── config.json | ||
├── generation_config.json | ||
├── pytorch_model.bin.index.json | ||
├── pytorch_model-00001-of-00003.bin | ||
... | ||
``` | ||
|
||
Then, set the path to the vicuna weight in the model config file | ||
[here](minigpt4/configs/models/minigpt4.yaml#L16) at Line 16. | ||
|
||
**3. Prepare the pretrained MiniGPT-4 checkpoint** | ||
|
||
To play with our pretrained model, download the pretrained checkpoint | ||
[here](https://drive.google.com/file/d/1a4zLvaiDBr-36pasffmgpvH5P7CKmpze/view?usp=share_link). | ||
Then, set the path to the pretrained checkpoint in the evaluation config file | ||
in [eval_configs/minigpt4_eval.yaml](eval_configs/minigpt4_eval.yaml#L10) at Line 10. | ||
|
||
|
||
|
||
### Launching Demo Locally | ||
|
||
Try out our demo [demo.py](demo.py) on your local machine by running | ||
|
||
``` | ||
python demo.py --cfg-path eval_configs/minigpt4_eval.yaml | ||
``` | ||
|
||
|
||
|
||
### Training | ||
The training of MiniGPT-4 contains two alignment stages. | ||
|
||
**1. First pretraining stage** | ||
|
||
In the first pretrained stage, the model is trained using image-text pairs from Laion and CC datasets | ||
to align the vision and language model. To download and prepare the datasets, please check | ||
our [first stage dataset preparation instruction](dataset/README_1_STAGE.md). | ||
After the first stage, the visual features are mapped and can be understood by the language | ||
model. | ||
To launch the first stage training, run the following command. In our experiments, we use 4 A100. | ||
You can change the save path in the config file | ||
[train_configs/minigpt4_stage1_pretrain.yaml](train_configs/minigpt4_stage1_pretrain.yaml) | ||
|
||
```bash | ||
torchrun --nproc-per-node NUM_GPU train.py --cfg-path train_configs/minigpt4_stage1_pretrain.yaml | ||
``` | ||
|
||
**1. Second finetuning stage** | ||
|
||
In the second stage, we use a small high quality image-text pair dataset created by ourselves | ||
and convert it to a conversation format to further align MiniGPT-4. | ||
To download and prepare our second stage dataset, please check our | ||
[second stage dataset preparation instruction](dataset/README_2_STAGE.md). | ||
To launch the second stage alignment, | ||
first specify the path to the checkpoint file trained in stage 1 in | ||
[train_configs/minigpt4_stage1_pretrain.yaml](train_configs/minigpt4_stage2_finetune.yaml). | ||
You can also specify the output path there. | ||
Then, run the following command. In our experiments, we use 1 A100. | ||
|
||
```bash | ||
torchrun --nproc-per-node NUM_GPU train.py --cfg-path train_configs/minigpt4_stage2_finetune.yaml | ||
``` | ||
|
||
After the second stage alignment, MiniGPT-4 is able to talk about the image coherently and user-friendly. | ||
|
||
|
||
|
||
|
||
## Acknowledgement | ||
|
||
+ [BLIP2](https://huggingface.co/docs/transformers/main/model_doc/blip-2) | ||
+ [Vicuna](https://github.com/lm-sys/FastChat) | ||
|
||
|
||
If you're using MiniGPT-4 in your research or applications, please cite using this BibTeX: | ||
```bibtex | ||
@misc{zhu2022minigpt4, | ||
title={MiniGPT-4: Enhancing Vision-language Understanding with Advanced Large Language Models}, | ||
author={Deyao Zhu and Jun Chen and Xiaoqian Shen and xiang Li and Mohamed Elhoseiny}, | ||
year={2023}, | ||
} | ||
``` | ||
|
||
## License | ||
This repository is under [BSD 3-Clause License](LICENSE.md). | ||
Many codes are based on [Lavis](https://github.com/salesforce/LAVIS) with | ||
BSD 3-Clause License [here](LICENSE_Lavis.md). |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,96 @@ | ||
## Download the filtered Conceptual Captions, SBU, LAION datasets | ||
|
||
### Pre-training datasets download: | ||
We use the filtered synthetic captions prepared by BLIP. For more details about the dataset, please refer to [BLIP](https://github.com/salesforce/BLIP). | ||
|
||
It requires ~2.3T to store LAION and CC3M+CC12M+SBU datasets | ||
|
||
Image source | Filtered synthetic caption by ViT-L | ||
--- | :---: | ||
CC3M+CC12M+SBU | <a href="https://storage.googleapis.com/sfr-vision-language-research/BLIP/datasets/ccs_synthetic_filtered_large.json">Download</a> | ||
LAION115M | <a href="https://storage.googleapis.com/sfr-vision-language-research/BLIP/datasets/laion_synthetic_filtered_large.json">Download</a> | ||
|
||
This will download two json files | ||
``` | ||
ccs_synthetic_filtered_large.json | ||
laion_synthetic_filtered_large.json | ||
``` | ||
|
||
## prepare the data step-by-step | ||
|
||
|
||
### setup the dataset folder and move the annotation file to the data storage folder | ||
``` | ||
export MINIGPT4_DATASET=/YOUR/PATH/FOR/LARGE/DATASET/ | ||
mkdir ${MINIGPT4_DATASET}/cc_sbu | ||
mkdir ${MINIGPT4_DATASET}/laion | ||
mv ccs_synthetic_filtered_large.json ${MINIGPT4_DATASET}/cc_sbu | ||
mv laion_synthetic_filtered_large.json ${MINIGPT4_DATASET}/laion | ||
``` | ||
|
||
### Convert the scripts to data storate folder | ||
``` | ||
cp convert_cc_sbu.py ${MINIGPT4_DATASET}/cc_sbu | ||
cp download_cc_sbu.sh ${MINIGPT4_DATASET}/cc_sbu | ||
cp convert_laion.py ${MINIGPT4_DATASET}/laion | ||
cp download_laion.sh ${MINIGPT4_DATASET}/laion | ||
``` | ||
|
||
|
||
### Convert the laion and cc_sbu annotation file format to be img2dataset format | ||
``` | ||
cd ${MINIGPT4_DATASET}/cc_sbu | ||
python convert_cc_sbu.py | ||
cd ${MINIGPT4_DATASET}/laion | ||
python convert_laion.py | ||
``` | ||
|
||
### Download the datasets with img2dataset | ||
``` | ||
cd ${MINIGPT4_DATASET}/cc_sbu | ||
sh download_cc_sbu.sh | ||
cd ${MINIGPT4_DATASET}/laion | ||
sh download_laion.sh | ||
``` | ||
|
||
|
||
The final dataset structure | ||
|
||
``` | ||
. | ||
├── ${MINIGPT4_DATASET} | ||
│ ├── cc_sbu | ||
│ ├── convert_cc_sbu.py | ||
│ ├── download_cc_sbu.sh | ||
│ ├── ccs_synthetic_filtered_large.json | ||
│ ├── ccs_synthetic_filtered_large.tsv | ||
│ └── cc_sbu_dataset | ||
│ ├── 00000.tar | ||
│ ├── 00000.parquet | ||
│ ... | ||
│ ├── laion | ||
│ ├── convert_laion.py | ||
│ ├── download_laion.sh | ||
│ ├── laion_synthetic_filtered_large.json | ||
│ ├── laion_synthetic_filtered_large.tsv | ||
│ └── laion_dataset | ||
│ ├── 00000.tar | ||
│ ├── 00000.parquet | ||
│ ... | ||
... | ||
``` | ||
|
||
|
||
## Set up the dataset configuration files | ||
|
||
Then, set up the LAION dataset loading path in | ||
[here](../minigpt4/configs/datasets/laion/defaults.yaml#L5) at Line 5 as | ||
${MINIGPT4_DATASET}/laion/laion_dataset/{00000..10488}.tar | ||
|
||
and the Conceptual Captoin and SBU datasets loading path in | ||
[here](../minigpt4/configs/datasets/cc_sbu/defaults.yaml#L5) at Line 5 as | ||
${MINIGPT4_DATASET}/cc_sbu/cc_sbu_dataset/{00000..01255}.tar | ||
|
||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,19 @@ | ||
## Second Stage Data Preparation | ||
|
||
Our second stage dataset can be downloaded from | ||
[here](https://drive.google.com/file/d/1nJXhoEcy3KTExr17I7BXqY5Y9Lx_-n-9/view?usp=share_link) | ||
After extraction, you will get a data follder with the following structure: | ||
|
||
``` | ||
cc_sbu_align | ||
├── filter_cap.json | ||
└── image | ||
├── 2.jpg | ||
├── 3.jpg | ||
... | ||
``` | ||
|
||
Put the folder to any path you want. | ||
Then, set up the dataset path in the dataset config file | ||
[here](../minigpt4/configs/datasets/cc_sbu/align.yaml#L5) at Line 5. | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,20 @@ | ||
import json | ||
import csv | ||
|
||
# specify input and output file paths | ||
input_file = 'ccs_synthetic_filtered_large.json' | ||
output_file = 'ccs_synthetic_filtered_large.tsv' | ||
|
||
# load JSON data from input file | ||
with open(input_file, 'r') as f: | ||
data = json.load(f) | ||
|
||
# extract header and data from JSON | ||
header = data[0].keys() | ||
rows = [x.values() for x in data] | ||
|
||
# write data to TSV file | ||
with open(output_file, 'w') as f: | ||
writer = csv.writer(f, delimiter='\t') | ||
writer.writerow(header) | ||
writer.writerows(rows) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,20 @@ | ||
import json | ||
import csv | ||
|
||
# specify input and output file paths | ||
input_file = 'laion_synthetic_filtered_large.json' | ||
output_file = 'laion_synthetic_filtered_large.tsv' | ||
|
||
# load JSON data from input file | ||
with open(input_file, 'r') as f: | ||
data = json.load(f) | ||
|
||
# extract header and data from JSON | ||
header = data[0].keys() | ||
rows = [x.values() for x in data] | ||
|
||
# write data to TSV file | ||
with open(output_file, 'w') as f: | ||
writer = csv.writer(f, delimiter='\t') | ||
writer.writerow(header) | ||
writer.writerows(rows) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
#!/bin/bash | ||
|
||
img2dataset --url_list ccs_synthetic_filtered_large.tsv --input_format "tsv"\ | ||
--url_col "url" --caption_col "caption" --output_format webdataset\ | ||
--output_folder cc_sbu_dataset --processes_count 16 --thread_count 128 --image_size 256 \ | ||
--enable_wandb True |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
#!/bin/bash | ||
|
||
img2dataset --url_list laion_synthetic_filtered_large.tsv --input_format "tsv"\ | ||
--url_col "url" --caption_col "caption" --output_format webdataset\ | ||
--output_folder laion_dataset --processes_count 16 --thread_count 128 --image_size 256 \ | ||
--enable_wandb True |
Oops, something went wrong.