Skip to content

Commit

Permalink
include llama2
Browse files Browse the repository at this point in the history
  • Loading branch information
TsuTikgiau committed Aug 28, 2023
1 parent bbd7883 commit fb8e2c6
Show file tree
Hide file tree
Showing 15 changed files with 412 additions and 121 deletions.
49 changes: 26 additions & 23 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,13 +1,13 @@
# MiniGPT-4: Enhancing Vision-language Understanding with Advanced Large Language Models
[Deyao Zhu](https://tsutikgiau.github.io/)* (On Job Market!), [Jun Chen](https://junchen14.github.io/)* (On Job Market!), [Xiaoqian Shen](https://xiaoqian-shen.github.io), [Xiang Li](https://xiangli.ac.cn), and [Mohamed Elhoseiny](https://www.mohamed-elhoseiny.com/). *Equal Contribution
[Deyao Zhu](https://tsutikgiau.github.io/)* , [Jun Chen](https://junchen14.github.io/)* (On Job Market!), [Xiaoqian Shen](https://xiaoqian-shen.github.io), [Xiang Li](https://xiangli.ac.cn), and [Mohamed Elhoseiny](https://www.mohamed-elhoseiny.com/). *Equal Contribution

**King Abdullah University of Science and Technology**

<a href='https://minigpt-4.github.io'><img src='https://img.shields.io/badge/Project-Page-Green'></a> <a href='https://arxiv.org/abs/2304.10592'><img src='https://img.shields.io/badge/Paper-Arxiv-red'></a> <a href='https://huggingface.co/spaces/Vision-CAIR/minigpt4'><img src='https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue'></a> <a href='https://huggingface.co/Vision-CAIR/MiniGPT-4'><img src='https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Model-blue'></a> [![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1OK4kYsZphwt5DXchKkzMBjYF6jnkqh4R?usp=sharing) [![YouTube](https://badges.aleen42.com/src/youtube.svg)](https://www.youtube.com/watch?v=__tftoxpBAw&feature=youtu.be)


## News
We now provide a pretrained MiniGPT-4 aligned with Vicuna-7B! The demo GPU memory consumption now can be as low as 12GB.
We now provide a llama 2 version of MiniGPT-4


## Online Demo
Expand Down Expand Up @@ -52,49 +52,52 @@ conda activate minigpt4
```


**2. Prepare the pretrained Vicuna weights**
**2. Prepare the pretrained LLM weights**

The current version of MiniGPT-4 is built on the v0 version of Vicuna-13B.
Please refer to our instruction [here](PrepareVicuna.md)
to prepare the Vicuna weights.
The final weights would be in a single folder in a structure similar to the following:
Currently, we provide both Vicuna V0 and Llama 2 version of MiniGPT-4.
Download the corresponding LLM weights from the following huggingface space via clone the repository using git-lfs.

| Vicuna V0 13B | Vicuna V0 7B | Llama 2 Chat 7B |
:------------------------------------------------------------------------------------------------:|:----------------------------------------------------------------------------------------------:|:----------------------------------------------------------------------------------------------:
[Downlad](https://huggingface.co/Vision-CAIR/vicuna/tree/main) | [Download](https://huggingface.co/Vision-CAIR/vicuna-7b/tree/main) | [Download](https://huggingface.co/meta-llama/Llama-2-7b-chat/tree/main)

```
vicuna_weights
├── config.json
├── generation_config.json
├── pytorch_model.bin.index.json
├── pytorch_model-00001-of-00003.bin
...
```

Then, set the path to the vicuna weight in the model config file
[here](minigpt4/configs/models/minigpt4.yaml#L16) at Line 16.
[here](minigpt4/configs/models/minigpt4_vicuna0.yaml#L18) at Line 18
and/or the path to the llama2 weight in the model config file
[here](minigpt4/configs/models/minigpt4_llama2.yaml#L15) at Line 15.

**3. Prepare the pretrained MiniGPT-4 checkpoint**

Download the pretrained checkpoints according to the Vicuna model you prepare.

| Checkpoint Aligned with Vicuna 13B | Checkpoint Aligned with Vicuna 7B |
:------------------------------------------------------------------------------------------------:|:----------------------------------------------------------------------------------------------:
[Downlad](https://drive.google.com/file/d/1a4zLvaiDBr-36pasffmgpvH5P7CKmpze/view?usp=share_link) | [Download](https://drive.google.com/file/d/1RY9jV0dyqLX-o38LrumkKRh6Jtaop58R/view?usp=sharing)
| Checkpoint Aligned with Vicuna 13B | Checkpoint Aligned with Vicuna 7B | Checkpoint Aligned with Llama 2 Chat 7B |
:------------------------------------------------------------------------------------------------:|:-----------------------------------------------------------------------------------------------:|:----------------------------------------------------------------------------------------------:
[Downlad](https://drive.google.com/file/d/1a4zLvaiDBr-36pasffmgpvH5P7CKmpze/view?usp=share_link) | [Download](https://drive.google.com/file/d/1RY9jV0dyqLX-o38LrumkKRh6Jtaop58R/view?usp=sharing) | [Download](https://drive.google.com/file/d/11nAPjEok8eAGGEG1N2vXo3kBLCg0WgUk/view?usp=sharing)


Then, set the path to the pretrained checkpoint in the evaluation config file
in [eval_configs/minigpt4_eval.yaml](eval_configs/minigpt4_eval.yaml#L10) at Line 11.
in [eval_configs/minigpt4_eval.yaml](eval_configs/minigpt4_eval.yaml#L10) at Line 8 for Vicuna version or [eval_configs/minigpt4_llama2_eval.yaml](eval_configs/minigpt4_llama2_eval.yaml#L10) for LLama2 version.



### Launching Demo Locally

Try out our demo [demo.py](demo.py) on your local machine by running
Try out our demo [demo.py](demo.py) for the vicuna version on your local machine by running

```
python demo.py --cfg-path eval_configs/minigpt4_eval.yaml --gpu-id 0
```

To save GPU memory, Vicuna loads as 8 bit by default, with a beam search width of 1.
This configuration requires about 23G GPU memory for Vicuna 13B and 11.5G GPU memory for Vicuna 7B.
or for Llama 2 version by

```
python demo.py --cfg-path eval_configs/minigpt4_llama2_eval.yaml --gpu-id 0
```


To save GPU memory, LLMs loads as 8 bit by default, with a beam search width of 1.
This configuration requires about 23G GPU memory for 13B LLM and 11.5G GPU memory for 7B LLM.
For more powerful GPUs, you can run the model
in 16 bit by setting low_resource to False in the config file
[minigpt4_eval.yaml](eval_configs/minigpt4_eval.yaml) and use a larger beam search width.
Expand Down
12 changes: 11 additions & 1 deletion demo.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@
from minigpt4.common.config import Config
from minigpt4.common.dist_utils import get_rank
from minigpt4.common.registry import registry
from minigpt4.conversation.conversation import Chat, CONV_VISION
from minigpt4.conversation.conversation import Chat, CONV_VISION_Vicuna0, CONV_VISION_LLama2

# imports modules for registration
from minigpt4.datasets.builders import *
Expand Down Expand Up @@ -50,6 +50,9 @@ def setup_seeds(config):
# Model Initialization
# ========================================

conv_dict = {'pretrain_vicuna0': CONV_VISION_Vicuna0,
'pretrain_llama2': CONV_VISION_LLama2}

print('Initializing Chat')
args = parse_args()
cfg = Config(args)
Expand All @@ -59,22 +62,27 @@ def setup_seeds(config):
model_cls = registry.get_model_class(model_config.arch)
model = model_cls.from_config(model_config).to('cuda:{}'.format(args.gpu_id))

CONV_VISION = conv_dict[model_config.model_type]

vis_processor_cfg = cfg.datasets_cfg.cc_sbu_align.vis_processor.train
vis_processor = registry.get_processor_class(vis_processor_cfg.name).from_config(vis_processor_cfg)
chat = Chat(model, vis_processor, device='cuda:{}'.format(args.gpu_id))
print('Initialization Finished')


# ========================================
# Gradio Setting
# ========================================


def gradio_reset(chat_state, img_list):
if chat_state is not None:
chat_state.messages = []
if img_list is not None:
img_list = []
return None, gr.update(value=None, interactive=True), gr.update(placeholder='Please upload your image first', interactive=False),gr.update(value="Upload & Start Chat", interactive=True), chat_state, img_list


def upload_img(gr_img, text_input, chat_state):
if gr_img is None:
return None, None, gr.update(interactive=True), chat_state, None
Expand All @@ -83,6 +91,7 @@ def upload_img(gr_img, text_input, chat_state):
llm_message = chat.upload_img(gr_img, chat_state, img_list)
return gr.update(interactive=False), gr.update(interactive=True, placeholder='Type and press Enter'), gr.update(value="Start Chatting", interactive=False), chat_state, img_list


def gradio_ask(user_message, chatbot, chat_state):
if len(user_message) == 0:
return gr.update(interactive=True, placeholder='Input should not be empty!'), chatbot, chat_state
Expand All @@ -101,6 +110,7 @@ def gradio_answer(chatbot, chat_state, img_list, num_beams, temperature):
chatbot[-1][1] = llm_message
return chatbot, chat_state, img_list


title = """<h1 align="center">Demo of MiniGPT-4</h1>"""
description = """<h3>This is the demo of MiniGPT-4. Upload your images and start chatting!</h3>"""
article = """<p><a href='https://minigpt-4.github.io'><img src='https://img.shields.io/badge/Project-Page-Green'></a></p><p><a href='https://github.com/Vision-CAIR/MiniGPT-4'><img src='https://img.shields.io/badge/Github-Code-blue'></a></p><p><a href='https://raw.githubusercontent.com/Vision-CAIR/MiniGPT-4/main/MiniGPT_4.pdf'><img src='https://img.shields.io/badge/Paper-PDF-red'></a></p>
Expand Down
7 changes: 2 additions & 5 deletions eval_configs/minigpt4_eval.yaml
Original file line number Diff line number Diff line change
@@ -1,14 +1,11 @@
model:
arch: mini_gpt4
model_type: pretrain_vicuna
freeze_vit: True
freeze_qformer: True
model_type: pretrain_vicuna0
max_txt_len: 160
end_sym: "###"
low_resource: True
prompt_path: "prompts/alignment.txt"
prompt_template: '###Human: {} ###Assistant: '
ckpt: '/path/to/pretrained/ckpt/'
ckpt: '/home/zhud/ibex/pretrained_minigpt4.pth'


datasets:
Expand Down
22 changes: 22 additions & 0 deletions eval_configs/minigpt4_llama2_eval.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
model:
arch: mini_gpt4
model_type: pretrain_llama2
max_txt_len: 160
end_sym: "</s>"
low_resource: True
prompt_template: '[INST] {} [/INST] '
ckpt: '/home/zhud/c2090/zhud/project/MiniGPT-4/minigpt4/output/minigpt4_stage2_finetune/20230826182/checkpoint_4.pth'


datasets:
cc_sbu_align:
vis_processor:
train:
name: "blip2_image_eval"
image_size: 224
text_processor:
train:
name: "blip_caption"

run:
task: image_text_pretrain
5 changes: 4 additions & 1 deletion minigpt4/common/dist_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,10 @@ def is_main_process():


def init_distributed_mode(args):
if "RANK" in os.environ and "WORLD_SIZE" in os.environ:
if args.distributed is False:
print("Not using distributed mode")
return
elif "RANK" in os.environ and "WORLD_SIZE" in os.environ:
args.rank = int(os.environ["RANK"])
args.world_size = int(os.environ["WORLD_SIZE"])
args.gpu = int(os.environ["LOCAL_RANK"])
Expand Down
29 changes: 29 additions & 0 deletions minigpt4/configs/models/minigpt4_llama2.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
model:
arch: mini_gpt4

# vit encoder
image_size: 224
drop_path_rate: 0
use_grad_checkpoint: False
vit_precision: "fp16"
freeze_vit: True
has_qformer: False

# generation configs
prompt: ""

llama_model: "/path/to/llama2/weight"

preprocess:
vis_processor:
train:
name: "blip2_image_train"
image_size: 224
eval:
name: "blip2_image_eval"
image_size: 224
text_processor:
train:
name: "blip_caption"
eval:
name: "blip_caption"
Original file line number Diff line number Diff line change
Expand Up @@ -12,12 +12,11 @@ model:
# Q-Former
num_query_token: 32

# Vicuna
llama_model: "/path/to/vicuna/weights/"

# generation configs
prompt: ""

llama_model: "/path/to/vicuna/weight"

preprocess:
vis_processor:
train:
Expand Down
22 changes: 16 additions & 6 deletions minigpt4/conversation/conversation.py
Original file line number Diff line number Diff line change
Expand Up @@ -39,18 +39,18 @@ def get_prompt(self):
ret = self.system + self.sep
for role, message in self.messages:
if message:
ret += role + ": " + message + self.sep
ret += role + message + self.sep
else:
ret += role + ":"
ret += role
return ret
elif self.sep_style == SeparatorStyle.TWO:
seps = [self.sep, self.sep2]
ret = self.system + seps[0]
for i, (role, message) in enumerate(self.messages):
if message:
ret += role + ": " + message + seps[i % 2]
ret += role + message + seps[i % 2]
else:
ret += role + ":"
ret += role
return ret
else:
raise ValueError(f"Invalid style: {self.sep_style}")
Expand Down Expand Up @@ -106,16 +106,26 @@ def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor):
return False


CONV_VISION = Conversation(
CONV_VISION_Vicuna0 = Conversation(
system="Give the following image: <Img>ImageContent</Img>. "
"You will be able to see the image once I provide it to you. Please answer my questions.",
roles=("Human", "Assistant"),
roles=("Human: ", "Assistant: "),
messages=[],
offset=2,
sep_style=SeparatorStyle.SINGLE,
sep="###",
)

CONV_VISION_LLama2 = Conversation(
system="Give the following image: <Img>ImageContent</Img>. "
"You will be able to see the image once I provide it to you. Please answer my questions.",
roles=("<s>[INST] ", " [/INST] "),
messages=[],
offset=2,
sep_style=SeparatorStyle.SINGLE,
sep="",
)



class Chat:
Expand Down
4 changes: 2 additions & 2 deletions minigpt4/datasets/datasets/cc_sbu_dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ def __init__(self, vis_processor, text_processor, location):
def to_dict(self, sample):
return {
"image": sample[0],
"text_input": self.text_processor(sample[1]["caption"]),
"answer": self.text_processor(sample[1]["caption"]),
}


Expand All @@ -42,6 +42,6 @@ def __getitem__(self, index):

return {
"image": image,
"text_input": caption,
"answer": caption,
"image_id": self.img_ids[ann["image_id"]],
}
2 changes: 1 addition & 1 deletion minigpt4/datasets/datasets/laion_dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,6 @@ def __init__(self, vis_processor, text_processor, location):
def to_dict(self, sample):
return {
"image": sample[0],
"text_input": self.text_processor(sample[1]["caption"]),
"answer": self.text_processor(sample[1]["caption"]),
}

Loading

0 comments on commit fb8e2c6

Please sign in to comment.