Skip to content

[IQA, Low-level Vision, MLLM] Low-level visual instruction tuning, with a 200K dataset and a model zoo for fine-tuned checkpoints.

Notifications You must be signed in to change notification settings

eltociear/Q-Instruct

 
 

Repository files navigation

Open in Spaces Hits

Q-Instruct: Improving Low-level Visual Abilities for Multi-modality Foundation Models

1Nanyang Technological University, 2Shanghai Jiaotong University, 3Sensetime Research, 4I2R@A*STAR
*Equal contribution. #Corresponding author.

Quick Start

If your server is facing a poor connection to Huggingface, we provide an alternative way to Download Weights from ModelScope. Click in to see details.

对于中国大陆地区的使用者,若您的服务器连接huggingface存在一些困难,我们亦提供通过魔搭下载权重的方式。敬请点击参阅指南

LLaVA-v1.5

Install LLaVA.

git clone https://github.com/haotian-liu/LLaVA.git
cd LLaVA
pip install -e .

Simple Interactive Demos.

See the codes and scripts below.

Example Code (Single Query)
from llava.mm_utils import get_model_name_from_path
from llava.eval.run_llava import eval_model
model_path = "teowu/llava_v1.5_7b_qinstruct_preview_v0.1" 
prompt = "Rate the quality of the image. Think step by step."
image_file = "fig/sausage.jpg"
args = type('Args', (), {
    "model_path": model_path,
    "model_base": None,
    "model_name": get_model_name_from_path(model_path),
    "query": prompt,
    "conv_mode": None,
    "image_file": image_file,
    "sep": ",",
})()
eval_model(args)
Example Code (CLI Demo for Multi-turn Conversation)
python -m llava.serve.cli \
    --model-path teowu/llava_v1.5_7b_qinstruct_preview_v0.1 \
    --image-file "fig/sausage.jpg" \

Note: The results may contain randomness as do_sample=True is enabled during conversation mode.

Quantitative Evaluations

Multi-choice question (MCQ) in Q-Bench.
python eval_scripts/llava_v1.5/eval_qbench_mcq.py
Image/Video Quality Assessment

Image Quality Assessment:

python eval_scripts/llava_v1.5/eval_image_quality.py

Video Quality Assessment:

python eval_scripts/llava_v1.5/eval_video_quality.py

mPLUG-Owl-2

For mPLUG-Owl-2, Only Single GPU Inference is supported now. Please set environmental variable (e.g. export CUDA_VISIBLE_DEVICES=0) to make sure that the model can be loaded on only one device.

Install mPLUG-Owl-2.

git clone https://github.com/X-PLUG/mPLUG-Owl.git
cd mPLUG_Owl/mPLUG_Owl2/ 
pip install -e .

Simple Interactive Demos

Example Code (Single Query)
from mplug_owl2.mm_utils import get_model_name_from_path
from eval_scripts.mplug_owl_2.run_mplug_owl2 import eval_model
model_path = "teowu/mplug_owl2_7b_448_qinstruct_preview_v0.1" 
prompt = "Rate the quality of the image. Think step by step."
image_file = "fig/sausage.jpg"
args = type('Args', (), {
    "model_path": model_path,
    "model_base": None,
    "model_name": get_model_name_from_path(model_path),
    "query": prompt,
    "conv_mode": None,
    "image_file": image_file,
    "sep": ",",
})()
eval_model(args)
Example Code (CLI Demo for Multi-turn Conversation)
python -m mplug_owl2.serve.cli \
    --model-path teowu/mplug_owl2_7b_448_qinstruct_preview_v0.1 \
    --image-file "fig/sausage.jpg" \

Note: The results may contain randomness as do_sample=True is enabled during conversation mode.

Quantitative Evaluations

Multi-choice question (MCQ) in Q-Bench.
python eval_scripts/mplug_owl_2/eval_qbench_mcq.py
Image/Video Quality Assessment

Image Quality Assessment:

python eval_scripts/mplug_owl_2/eval_image_quality.py

Video Quality Assessment:

python eval_scripts/mplug_owl_2/eval_video_quality.py

InternLM-XComposer-VL

InternLM-XComposer-VL has been integrated into Huggingface AutoModel (remote code mode). You can directly start with the code below without a separate install process.

Simple Interactive Demos

Example Code (Single Query)
import torch
from transformers import AutoModel, AutoTokenizer

torch.set_grad_enabled(False)

# init model and tokenizer
model = AutoModel.from_pretrained('DLight1551/internlm-xcomposer-vl-7b-qinstruct-full', trust_remote_code=True).cuda().eval()
tokenizer = AutoTokenizer.from_pretrained('DLight1551/internlm-xcomposer-vl-7b-qinstruct-full', trust_remote_code=True)
model.tokenizer = tokenizer

# Single-Turn Text-Image Dialogue
text = 'Describe and evaluate the quality of the image.'
image = 'fig/sausage.jpg'
response = model.generate(text, image)
print(response)
Example Code (Multi-Turn Conversation)
import torch
from transformers import AutoModel, AutoTokenizer

torch.set_grad_enabled(False)

# init model and tokenizer
model = AutoModel.from_pretrained('DLight1551/internlm-xcomposer-vl-7b-qinstruct-full', trust_remote_code=True).cuda().eval()
tokenizer = AutoTokenizer.from_pretrained('DLight1551/internlm-xcomposer-vl-7b-qinstruct-full', trust_remote_code=True)
model.tokenizer = tokenizer

# Multi-Turn Dialogue
text = 'Describe and evaluate the quality of the image.'
image = 'fig/sausage.jpg'
response, history = model.chat(text, image, history=None)
print(f'User: {text}')
print(f'Bot: {response}')

text = 'Which part of the pan is clearer, the top part of the bottom part?'
response, history = model.chat(text=text, image=None, history=history)
print(f'User: {text}')
print(f'Bot: {response}')

Quantitative Evaluations

Multi-choice question (MCQ) in Q-Bench.
python eval_scripts/internlm_xcomposer_vl/eval_qbench_mcq.py
Image/Video Quality Assessment

Image Quality Assessment:

python eval_scripts/internlm_xcomposer_vl/eval_image_quality.py

Video Quality Assessment:

python eval_scripts/internlm_xcomposer_vl/eval_video_quality.py

Model Zoo

See Model Zoo. Both huggingface and modelscope weights are provided.

Training

At present, we only provide the training scripts with LLaVA-v1.5 (7B/13B). Please see Training Docs for more details.

License

Researchers and open-source developers are free to use the Q-Instruct dataset and the fine-tuned weights as provided for the four MLLMs. We also allow commercial use, while any commercial use should be pre-permitted by our team. Please email [email protected] to gain the permission for commercial use.

About

[IQA, Low-level Vision, MLLM] Low-level visual instruction tuning, with a 200K dataset and a model zoo for fine-tuned checkpoints.

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python 89.5%
  • Shell 10.5%