Q-Instruct: Improving Low-level Visual Abilities for Multi-modality Foundation Models

Q-Instruct: Improving Low-level Visual Abilities for Multi-modality Foundation Models

Haoning Wu¹^*, Zicheng Zhang²^*, Erli Zhang¹^*, Chaofeng Chen¹, Liang Liao¹, Annan Wang¹, Kaixin Xu⁴,

Chunyi Li², Jingwen Hou¹, Guangtao Zhai², Geng Xue⁴, Wenxiu Sun³, Qiong Yan³, Weisi Lin¹^#

¹Nanyang Technological University, ²Shanghai Jiaotong University, ³Sensetime Research, ⁴I2R@A*STAR

^*Equal contribution. ^#Corresponding author.

Dataset | Model Zoo | Paper (Preview) | Demo (Huggingface)

Quick Start

If your server is facing a poor connection to Huggingface, we provide an alternative way to Download Weights from ModelScope. Click in to see details.

对于中国大陆地区的使用者，若您的服务器连接huggingface存在一些困难，我们亦提供通过魔搭下载权重的方式。敬请点击参阅指南。

LLaVA-v1.5

Install LLaVA.

git clone https://github.com/haotian-liu/LLaVA.git
cd LLaVA
pip install -e .

Simple Interactive Demos.

See the codes and scripts below.

Example Code (Single Query)

from llava.mm_utils import get_model_name_from_path
from llava.eval.run_llava import eval_model
model_path = "teowu/llava_v1.5_7b_qinstruct_preview_v0.1" 
prompt = "Rate the quality of the image. Think step by step."
image_file = "fig/sausage.jpg"
args = type('Args', (), {
    "model_path": model_path,
    "model_base": None,
    "model_name": get_model_name_from_path(model_path),
    "query": prompt,
    "conv_mode": None,
    "image_file": image_file,
    "sep": ",",
})()
eval_model(args)

Example Code (CLI Demo for Multi-turn Conversation)

python -m llava.serve.cli \
    --model-path teowu/llava_v1.5_7b_qinstruct_preview_v0.1 \
    --image-file "fig/sausage.jpg" \

Note: The results may contain randomness as do_sample=True is enabled during conversation mode.

Quantitative Evaluations

Multi-choice question (MCQ) in Q-Bench.

python eval_scripts/llava_v1.5/eval_qbench_mcq.py

Image/Video Quality Assessment

Image Quality Assessment:

python eval_scripts/llava_v1.5/eval_image_quality.py

Video Quality Assessment:

python eval_scripts/llava_v1.5/eval_video_quality.py

mPLUG-Owl-2

For mPLUG-Owl-2, Only Single GPU Inference is supported now. Please set environmental variable (e.g. export CUDA_VISIBLE_DEVICES=0) to make sure that the model can be loaded on only one device.

Install mPLUG-Owl-2.

git clone https://github.com/X-PLUG/mPLUG-Owl.git
cd mPLUG_Owl/mPLUG_Owl2/ 
pip install -e .

Simple Interactive Demos

Example Code (Single Query)

from mplug_owl2.mm_utils import get_model_name_from_path
from eval_scripts.mplug_owl_2.run_mplug_owl2 import eval_model
model_path = "teowu/mplug_owl2_7b_448_qinstruct_preview_v0.1" 
prompt = "Rate the quality of the image. Think step by step."
image_file = "fig/sausage.jpg"
args = type('Args', (), {
    "model_path": model_path,
    "model_base": None,
    "model_name": get_model_name_from_path(model_path),
    "query": prompt,
    "conv_mode": None,
    "image_file": image_file,
    "sep": ",",
})()
eval_model(args)

Example Code (CLI Demo for Multi-turn Conversation)

python -m mplug_owl2.serve.cli \
    --model-path teowu/mplug_owl2_7b_448_qinstruct_preview_v0.1 \
    --image-file "fig/sausage.jpg" \

Note: The results may contain randomness as do_sample=True is enabled during conversation mode.

Quantitative Evaluations

Multi-choice question (MCQ) in Q-Bench.

python eval_scripts/mplug_owl_2/eval_qbench_mcq.py

Image/Video Quality Assessment

Image Quality Assessment:

python eval_scripts/mplug_owl_2/eval_image_quality.py

Video Quality Assessment:

python eval_scripts/mplug_owl_2/eval_video_quality.py

InternLM-XComposer-VL

InternLM-XComposer-VL has been integrated into Huggingface AutoModel (remote code mode). You can directly start with the code below without a separate install process.

Simple Interactive Demos

Example Code (Single Query)

import torch
from transformers import AutoModel, AutoTokenizer

torch.set_grad_enabled(False)

# init model and tokenizer
model = AutoModel.from_pretrained('DLight1551/internlm-xcomposer-vl-7b-qinstruct-full', trust_remote_code=True).cuda().eval()
tokenizer = AutoTokenizer.from_pretrained('DLight1551/internlm-xcomposer-vl-7b-qinstruct-full', trust_remote_code=True)
model.tokenizer = tokenizer

# Single-Turn Text-Image Dialogue
text = 'Describe and evaluate the quality of the image.'
image = 'fig/sausage.jpg'
response = model.generate(text, image)
print(response)

Example Code (Multi-Turn Conversation)

import torch
from transformers import AutoModel, AutoTokenizer

torch.set_grad_enabled(False)

# init model and tokenizer
model = AutoModel.from_pretrained('DLight1551/internlm-xcomposer-vl-7b-qinstruct-full', trust_remote_code=True).cuda().eval()
tokenizer = AutoTokenizer.from_pretrained('DLight1551/internlm-xcomposer-vl-7b-qinstruct-full', trust_remote_code=True)
model.tokenizer = tokenizer

# Multi-Turn Dialogue
text = 'Describe and evaluate the quality of the image.'
image = 'fig/sausage.jpg'
response, history = model.chat(text, image, history=None)
print(f'User: {text}')
print(f'Bot: {response}')

text = 'Which part of the pan is clearer, the top part of the bottom part?'
response, history = model.chat(text=text, image=None, history=history)
print(f'User: {text}')
print(f'Bot: {response}')

Quantitative Evaluations

Multi-choice question (MCQ) in Q-Bench.

python eval_scripts/internlm_xcomposer_vl/eval_qbench_mcq.py

Image/Video Quality Assessment

Image Quality Assessment:

python eval_scripts/internlm_xcomposer_vl/eval_image_quality.py

Video Quality Assessment:

python eval_scripts/internlm_xcomposer_vl/eval_video_quality.py

Model Zoo

See Model Zoo. Both huggingface and modelscope weights are provided.

Training

At present, we only provide the training scripts with LLaVA-v1.5 (7B/13B). Please see Training Docs for more details.

License

Researchers and open-source developers are free to use the Q-Instruct dataset and the fine-tuned weights as provided for the four MLLMs. We also allow commercial use, while any commercial use should be pre-permitted by our team. Please email [email protected] to gain the permission for commercial use.

Name		Name	Last commit message	Last commit date
Latest commit History 61 Commits
.github/workflows		.github/workflows
benchmark_results		benchmark_results
eval_scripts		eval_scripts
fig		fig
model_zoo		model_zoo
scripts/llava_v1.5		scripts/llava_v1.5
README.md		README.md
_config.yaml		_config.yaml
new_q_instruct.png		new_q_instruct.png
q_instruct_logo.png		q_instruct_logo.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Q-Instruct: Improving Low-level Visual Abilities for Multi-modality Foundation Models

Quick Start

LLaVA-v1.5

Install LLaVA.

Simple Interactive Demos.

Quantitative Evaluations

mPLUG-Owl-2

Install mPLUG-Owl-2.

Simple Interactive Demos

Quantitative Evaluations

InternLM-XComposer-VL

Simple Interactive Demos

Quantitative Evaluations

Model Zoo

Training

License

About

Releases

Packages

Languages

eltociear/Q-Instruct

Folders and files

Latest commit

History

Repository files navigation

Q-Instruct: Improving Low-level Visual Abilities for Multi-modality Foundation Models

Quick Start

LLaVA-v1.5

Install LLaVA.

Simple Interactive Demos.

Quantitative Evaluations

mPLUG-Owl-2

Install mPLUG-Owl-2.

Simple Interactive Demos

Quantitative Evaluations

InternLM-XComposer-VL

Simple Interactive Demos

Quantitative Evaluations

Model Zoo

Training

License

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages