Skip to content

AI-TestBot/Vision-Languages-Models-Testing-Resources

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

92 Commits
 
 

Repository files navigation

Vision Languages Models (VLMs) Testing Resources

📒Introduction

Vision Languages Models (VLMs) Testing Resources: A curated list of Awesome VLMs Testing Papers with Codes, check 📖Contents for more details. This repo is still updated frequently ~ 👨‍💻‍ Welcome to star ⭐️ or submit a PR to this repo! I will review and merge it.

📖Contents

📖Review

A Survey on Benchmarks of Multimodal Large Language Models.
J Li, W Lu.
ArXiv, 2024. [ArXiv] [Github]

A Survey on Evaluation of Multimodal Large Language Models.
J Huang, J Zhang.
arxiv:2408.15769, 2024. [ArXiv]

From GPT-4 to Gemini and Beyond: Assessing the Landscape of MLLMs on Generalizability, Trustworthiness and Causality through Four Modalities.
C Lu, C Qian, G Zheng, H Fan, H Gao, J Zhang, J Shao, J Deng, J Fu, K Huang, K Li, L Li, et al.
ArXiv, 2024. [ArXiv] [Github]

📖General

Comprehensive

[Lvlm-ehub] Lvlm-ehub: A comprehensive evaluation benchmark for large vision-language models.
P Xu, W Shao, K Zhang, P Gao, S Liu, M Lei, F Meng, S Huang, Y Qiao, P Luo.
arXiv:2306.09265, 2023. [ArXiv]

[Mmbench] Mmbench: Is your multi-modal model an all-around player?
Y Liu, H Duan, Y Zhang, B Li, S Zhang, W Zhao, Y Yuan, J Wang, C He, Z Liu, K Chen, D Lin.
arXiv:2307.06281, 2023. [ArXiv] [Github]

[MME] MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models.
Fu, Chaoyou and Chen, Peixian and Shen, Yunhang and Qin, Yulei and Zhang, Mengdan and Lin, Xu and Yang, Jinrui and Zheng, Xiawu and Li, Ke and Sun, Xing and others.
arXiv:2306.13394, 2023.

[Mm-vet] Mm-vet: Evaluating large multimodal models for integrated capabilities.
W Yu, Z Yang, L Li, J Wang, K Lin, Z Liu, X Wang, L Wang.
arXiv:2308.02490, 2023. [ArXiv] [Github]

[OwlEval] mplug-owl: Modularization empowers large language models with multimodality.
Q Ye, H Xu, G Xu, J Ye, M Yan, Y Zhou, J Wang, A Hu, P Shi, Y Shi, C Li, Y Xu, H Chen, et al.
arXiv:2304.14178, 2023. [ArXiv]

[Seed-bench] Seed-bench: Benchmarking multimodal llms with generative comprehension.
B Li, R Wang, G Wang, Y Ge, Y Ge, Y Shan.
arXiv:2307.16125, 2023. [ArXiv] [Github]

[Touchstone] Touchstone: Evaluating vision-language models by language models.
S Bai, S Yang, J Bai, P Wang, X Zhang, J Lin, X Wang, C Zhou, J Zhou.
arXiv:2308.16890, 2023. [ArXiv] [Github]

Video-bench: A comprehensive benchmark and toolkit for evaluating video-based large language models.
M Ning, B Zhu, Y **e, B Lin, J Cui, L Yuan, D Chen, L Yuan.
arxiv:2311.16103, 2023. [ArXiv] [Github]

Towards an Exhaustive Evaluation of Vision-Language Foundation Models.
E Salin, S Ayache, B Favre.
ICCV, 2023. [Paper]

[HR-Bench] Divide, Conquer and Combine: A Training-Free Framework for High-Resolution Image Perception in Multimodal Large Language Models.
W Wang, L Ding, M Zeng, X Zhou, L Shen, Y Luo, D Tao.
ArXiv, 2024. [ArXiv] [Github]

[Blink] Blink: Multimodal large language models can see but not perceive.
X Fu, Y Hu, B Li, Y Feng, H Wang, X Lin, D Roth, NA Smith, WC Ma, R Krishna.
arxiv:2404.12390, 2024. [ArXiv] [Github]

[MME-RealWorld] MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?
YF Zhang, H Zhang, H Tian, C Fu, S Zhang, J Wu, F Li, K Wang, Q Wen, Z Zhang, L Wang, et al.
arXiv:2408.13257, 2024. [ArXiv]

[Mmt-bench] Mmt-bench: A comprehensive multimodal benchmark for evaluating large vision-language models towards multitask agi.
K Ying, F Meng, J Wang, Z Li, H Lin, Y Yang, et al.
arXiv, 2024. [ArXiv] [Github] [HuggingFace]

[MuirBench] MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding.
F Wang, X Fu, JY Huang, Z Li, Q Liu, X Liu, MD Ma, N Xu, W Zhou, K Zhang, TL Yan, WJ Mo, et al.
arXiv:2406.09411, 2024. [ArXiv] [Github] [HuggingFace]

[MMMU] Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi.
X Yue, Y Ni, K Zhang, T Zheng, R Liu, G Zhang, S Stevens, D Jiang, W Ren, Y Sun, C Wei, et al.
CVPR, 2024. [Paper] [Github]

[Vbench] Vbench: Comprehensive benchmark suite for video generative models.
Z Huang, Y He, J Yu, F Zhang, C Si, Y Jiang, Y Zhang, T Wu, Q Jin, N Chanpaisit, Y Wang, et al.
CVPR, 2024. [ArXiv] [Github]

Beyond task performance: Evaluating and reducing the flaws of large multimodal models with in-context learning.
M Shukor, A Rame, C Dancette, M Cord.
ICLR, 2024. [ArXiv] [Github]

Understanding

Date Task Title Paper HomePage Github DataSets
2023 Content [MM-BigBench] MM-BigBench: Evaluating Multimodal Models on Multimodal Content Comprehension Tasks. [ArXiv] - [Github] -
2024 Dialog [MMDU] MMDU: A Multi-Turn Multi-Image Dialog Understanding Benchmark and Instruction-Tuning Dataset for LVLM. [ArXiv] - [Github] -
2024 Relation [CRPE] The all-seeing project v2: Towards general relation comprehension of the open world. [ArXiv] - [Github] [HuggingFace]
2023 Image [Journeydb] Journeydb: A benchmark for generative image understanding. [NeurIPS] - [Github] -
2024 Image [MMIU] MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language Models. [ArXiv] - [Github] -
2024 Image [MMLongBench-Doc] MMLongBench-Doc: Benchmarking Long-context Document Understanding with Visualizations. [ArXiv] - [Github] -
2024 Video [ET Bench] ET Bench: Towards Open-Ended Event-Level Video-Language Understanding. [ArXiv] - [Github] -
2024 Video [MVBench] MVBench: A Comprehensive Multi-modal Video Understanding Benchmark. [CVPR] - [Github] -
2024 Video [VideoVista] VideoVista: A Versatile Benchmark for Video Understanding and Reasoning. [ArXiv] - [Github] -
2024 Video [MLVU] MLVU: A Comprehensive Benchmark for Multi-Task Long Video Understanding. [ArXiv] - [Github] -

Generation

Text-to-Image

Date Task Title Paper HomePage Github DataSets
2023 Text-to-Image Let's ViCE! Mimicking Human Cognitive Behavior in Image Generation Evaluation. [ACMMM] - - -
2023 Text-to-Image Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis. [ArXiv] - [Github] -
2023 Text-to-Image Pku-i2iqa: An image-to-image quality assessment database for ai generated images. [ArXiv] - [Github] -
2023 Text-to-Image Toward verifiable and reproducible human evaluation for text-to-image generation. [CVPR] - - -
2023 Text-to-Image Hrs-bench: Holistic, reliable and scalable benchmark for text-to-image models. [ICCV] - [Github] -
2023 Text-to-Image T2i-compbench: A comprehensive benchmark for open-world compositional text-to-image generation. [NeurIPS] - [Github] -
2023 Text-to-Image Agiqa-3k: An open database for ai-generated image quality assessment. [TCSVT] - [Github] -
2024 Text-to-Image Evaluating Text to Image Synthesis: Survey and Taxonomy of Image Quality Metrics. [ArXiv] - [Github] -
2024 Text-to-Image Evaluating text-to-visual generation with image-to-text generation. [ArXiv] - - -
2024 Text-to-Image UNIAA: A Unified Multi-modal Image Aesthetic Assessment Baseline and Benchmark. [ArXiv] - - -
2024 Text-to-Image Aigiqa-20k: A large database for ai-generated image quality assessment. [CVPRW] - - [DataSets]
2024 Text-to-Image Holistic evaluation of text-to-image models. [NeurIPS] - [Github] -
2024 Text-to-Image Imagereward: Learning and evaluating human preferences for text-to-image generation. [NeurIPS] - - -
2024 Text-to-Image Llmscore: Unveiling the power of large language models in text-to-image synthesis evaluation. [NeurIPS] - [Github] -
2024 Text-to-Image EVALALIGN: Supervised Fine-Tuning Multimodal LLMs with Human-Aligned Data for Evaluating Text-to-Image Models. [ArXiv] - [Github] -
2024 Text-to-Image PhyBench: A Physical Commonsense Benchmark for Evaluating Text-to-Image Models. [ArXiv] - [Github] -
2024 Text-to-Image PTlTScore: Towards Long-Tail Effects in Text-to-Visual Evaluation with Generative Foundation Models. [CVPR] - - -
2024 Text-to-Image FlashEval: Towards Fast and Accurate Evaluation of Text-to-image Diffusion Generative Models. [CVPR] - [Github] -

Text-to-Video

Date Task Title Paper HomePage Github DataSets
2023 Text-to-Video Fetv: A benchmark for fine-grained evaluation of open-domain text-to-video generation. [NeurIPS] - [Github] -
2024 Text-to-Video AIGCBench: Comprehensive Evaluation of Image-to-Video Content Generated by AI. [ArXiv] - [Github] -
2024 Text-to-Video Evaluation of Text-to-Video Generation Models: A Dynamics Perspective. [ArXiv] - [Github] -
2024 Text-to-Video GAIA: Rethinking Action Quality Assessment for AI-Generated Videos. [ArXiv] - [Github] -
2024 Text-to-Video Subjective-Aligned Dateset and Metric for Text-to-Video Quality Assessment. [ArXiv] - [Github] -
2024 Text-to-Video T2V-CompBench: A Comprehensive Benchmark for Compositional Text-to-video Generation. [ArXiv] - [Github] -
2024 Text-to-Video TC-Bench: Benchmarking Temporal Compositionality in Text-to-Video and Image-to-Video Generation. [ArXiv] - [Github] -
2024 Text-to-Video VideoPhy: Evaluating Physical Commonsense for Video Generation. [ArXiv] - [Github] -
2024 Text-to-Video MantisScore: Building Automatic Metrics to Simulate Fine-grained Human Feedback for Video Generation. [ArXiv] - [Github] -
2024 Text-to-Video AIGC-VQA: A Holistic Perception Metric for AIGC Video Quality Assessment. [CVPR] - - -
2024 Text-to-Video Evalcrafter: Benchmarking and evaluating large video generation models. [CVPR] [Homepage] [Github] -
2024 Text-to-Video T2VBench: Benchmarking Temporal Dynamics for Text-to-Video Generation. [CVPR] - [Github] -
2024 Text-to-Video Benchmarking AIGC Video Quality Assessment: A Dataset and Unified Model. [ArXiv] - - -

VQA

Vqa: Visual question answering.
S Antol, A Agrawal, J Lu, M Mitchell, et al.
ICCV, 2015. [Paper] [Homepage]

Making the v in vqa matter: Elevating the role of image understanding in visual question answering.
Y Goyal, T Khot, D Summers-Stay, D Batra, D Parikh, et al.
CVPR, 2017. [Paper] [Homepage]

[Ok-vqa] Ok-vqa: A visual question answering benchmark requiring external knowledge.
K Marino, M Rastegari, A Farhadi, R Mottaghi.
CVPR, 2019. [Paper]

[TextVQA] Towards VQA Models That Can Read.
A Singh, V Natarajan, M Shah, et al.
CVPR, 2019. [Paper] [Homepage]

[DocVQA] Docvqa: A dataset for vqa on document images.
M Mathew, D Karatzas, CV Jawahar.
WACV, 2021. [Paper] [Homepage]

[ChartQA] ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasonin.
A Masry, DX Long, JQ Tan, S Joty, E Hoque.
arxiv:2203.10244, 2022. [Paper]

[ScienceQA] Learn to explain: Multimodal reasoning via thought chains for science question answering.
P Lu, S Mishra, T Xia, L Qiu, KW Chang, SC Zhu, O Tafjord, P Clark, A Kalyan.
Advances in Neural Information Processing Systems, 2022. [NeurIPS] [Github]

KNVQA: A Benchmark for evaluation knowledge-based VQA.
S Cheng, S Zhang, J Wu, M Lan.
arXiv:2311.12639, 2023.

Maqa: A multimodal qa benchmark for negation.
JY Li, A Jansen, Q Huang, J Lee, R Ganti, D Kuzmin.
arXiv:2301.03238, 2023. [ArXiv]

Multimodal multi-hop question answering through a conversation between tools and efficiently finetuned large language models.
H Rajabzadeh, S Wang, HJ Kwon, B Liu.
arXiv:2309.08922, 2023. [ArXiv]

Scigraphqa: A large-scale synthetic multi-turn question-answering dataset for scientific graphs.
S Li, N Tajbakhsh.
arXiv:2308.03349, 2023. [ArXiv]

Slidevqa: A dataset for document visual question answering on multiple images.
R Tanaka, K Nishida, K Nishida, T Hasegawa, I Saito, K Saito.
AAAI, 2023. [ArXiv] [Github]

CMMU: A Benchmark for Chinese Multi-modal Multi-type Question Understanding and Reasoning.
Z He, X Wu, P Zhou, R Xuan, G Liu, X Yang, Q Zhu, H Huang.
arXiv:2401.14011, 2024. [ArXiv] [Github]

TableVQA-Bench: A Visual Question Answering Benchmark on Multiple Table Domains.
Y Kim, M Yim, KY Song.
arXiv:2404.19205, 2024. [ArXiv] [Github]

MTVQA: Benchmarking Multilingual Text-Centric Visual Question Answering.
J Tang, Q Liu, Y Ye, J Lu, S Wei, C Lin, W Li, MFFB Mahmood, H Feng, Z Zhao, Y Wang, et al.
arXiv:2405.11985, 2024. [ArXiv] [Github]

Reasoning

InfiMM-Eval: Complex Open-Ended Reasoning Evaluation For Multi-Modal Large Language Models.
X Han, Q You, Y Liu, W Chen, H Zheng, K Mrini, et al.
arXiv:2311.11567, 2023. [ArXiv]

Measuring and improving chain-of-thought reasoning in vision-language models.
Y Chen, K Sikka, M Cogswell, H Ji, A Divakaran.
arXiv:2309.04461, 2023. [ArXiv] [Github]

Compbench: A comparative reasoning benchmark for multimodal llms.
J Kil, Z Mai, J Lee, Z Wang, K Cheng, L Wang, Y Liu, A Chowdhury, WL Chao.
arXiv:2407.16837, 2024. [ArXiv] [Github]

Complex Video Reasoning and Robustness Evaluation Suite for Video-LMMs.
MU Khattak, MF Naeem, J Hassan, M Naseer, et al.
arXiv, 2024. [ArXiv] [Github]

ConTextual: Evaluating Context-Sensitive Text-Rich Visual Reasoning in Large Multimodal Models.
R Wadhawan, H Bansal, KW Chang, N Peng.
arXiv:2401.13311, 2024. [ArXiv] [Github]

Exploring the reasoning abilities of multimodal large language models (mllms): A comprehensive survey on emerging trends in multimodal reasoning.
Y Wang, W Chen, X Han, X Lin, H Zhao, Y Liu, B Zhai, J Yuan, Q You, H Yang.
arXiv:2401.06805, 2024. [ArXiv]

Mementos: A comprehensive benchmark for multimodal large language model reasoning over image sequences.
X Wang, Y Zhou, X Liu, H Lu, Y Xu, F He, J Yoon, T Lu, G Bertasius, M Bansal, H Yao, et al.
arXiv:2401.10529, 2024. [ArXiv] [Github]

NPHardEval4V: A Dynamic Reasoning Benchmark of Multimodal Large Language Models.
L Fan, W Hua, X Li, K Zhu, M Jin, L Li, H Ling, J Chi, J Wang, X Ma, Y Zhang.
arXiv:2403.01777, 2024. [ArXiv] [Github]

Mathvista: Evaluating math reasoning in visual contexts with gpt-4v, bard, and other large multimodal models.
P Lu, H Bansal, T **a, J Liu, C Li, H Hajishirzi, et al.
ICLR, 2024. [Homepage] [Github]

MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark.
X Yue, T Zheng, Y Ni, Y Wang, K Zhang, S Tong, Y Sun, M Yin, B Yu, G Zhang, H Sun, Y Su, et al.
arxiv:2409.02813, 2024. [ArXiv] [Github]

[MATH-V] Measuring multimodal mathematical reasoning with math-vision dataset.
K Wang, J Pan, W Shi, Z Lu, M Zhan, H Li.
arXiv:2402.14804, 2024. [ArXiv] [Github]

[MMMU] Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi.
X Yue, Y Ni, K Zhang, T Zheng, R Liu, G Zhang, S Stevens, D Jiang, W Ren, Y Sun, C Wei, et al.
CVPR, 2024. [Paper] [Github]

[Mathverse] Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems?.
R Zhang, D Jiang, Y Zhang, H Lin, Z Guo, P Qiu, A Zhou, P Lu, KW Chang, P Gao, H Li.
arXiv:2403.14624, 2024. [ArXiv] [Github]

Multilingual

CVQA: Culturally-diverse Multilingual Visual Question Answering Benchmark.
D Romero, C Lyu, HA Wibowo, T Lynn, I Hamed, AN Kishore, A Mandal, A Dragonetti, et al.
arXiv:2406.05967, 2024. [Paper] [DataSets]

[MMMB] Parrot: Multilingual Visual Instruction Tuning.
HL Sun, DW Zhou, Y Li, S Lu, C Yi, QG Chen, Z Xu, W Luo, K Zhang, DC Zhan, HJ Ye.
arXiv:2406.02539, 2024. [Paper] [Github]

Instruction-Following

[Visit-bench] Visit-bench: A benchmark for vision-language instruction following inspired by real-world use.
Y Bitton, H Bansal, J Hessel, R Shao, W Zhu, A Awadalla, J Gardner, R Taori, L Schimdt.
arXiv:2308.06595, 2023. [ArXiv] [Github]

Visual instruction tuning.
H Liu, C Li, Q Wu, YJ Lee.
NeurIPS, 2024. [Paper] [Homepage]

MIA-Bench: Towards Better Instruction Following Evaluation of Multimodal LLMs.
Y Qian, H Ye, JP Fauconnier, P Grasch, Y Yang, et al.
arXiv, 2024. [Paper] [Homepage]

High-Level-Vision

OCR

[OCRBench] On the hidden mystery of ocr in large multimodal models.
Y Liu, Z Li, H Li, W Yu, M Huang, D Peng, M Liu, M Chen, C Li, L Jin, X Bai.
arXiv:2305.07895, 2023. [ArXiv] [Github]

Aesthetics

[Aesbench] Aesbench: An expert benchmark for multimodal large language models on image aesthetics perception.
Y Huang, Q Yuan, X Sheng, Z Yang, H Wu, P Chen, Y Yang, L Li, W Lin.
arXiv:2401.08276, 2024. [ArXiv] [Github]

[A-Bench] A-Bench: Are LMMs Masters at Evaluating AI-generated Images?
Z Zhang, H Wu, C Li, Y Zhou, W Sun, X Min, Z Chen, X Liu, W Lin, G Zhai.
arXiv:2406.03070, 2024. [ArXiv] [Github]

Low-Level-Vision

Q-bench: A benchmark for general-purpose foundation models on low-level vision.
H Wu, Z Zhang, E Zhang, C Chen, L Liao, A Wang, C Li, W Sun, Q Yan, G Zhai, W Lin.
arXiv:2309.14181, 2023. [ArXiv] [Github]

A Benchmark for Multi-modal Foundation Models on Low-level Vision: from Single Images to Pairs.
Z Zhang, H Wu, E Zhang, G Zhai, W Lin.
arXiv:2402.07116, 2024. [ArXiv] [Github]

Reliable

Date Task Title Paper HomePage Github DataSets
2023 Hallucination An llm-free multi-dimensional benchmark for mllms hallucination evaluation. [ArXiv] - [Github] -
2024 Hallucination [POPE] Evaluating object hallucination in large vision-language models. [ArXiv] - - -
2024 Hallucination LMMs-Eval: Reality Check on the Evaluation of Large Multimodal Models. [ArXiv] - - -
2024 Hallucination [Hallusionbench] Hallusionbench: You see what you think? or you think what you see? an image-context reasoning benchmark challenging for gpt-4v (ision), llava-1.5, and other multi-modality models. [CVPR] - [Github] -
2024 Hallucination Evaluating the Quality of Hallucination Benchmarks for Large Vision-Language Models. [ArXiv] - [Github] -

Robust

Fool your (vision and) language model with embarrassingly simple permutations.
Y Zong, T Yu, et al.
arXiv, 2024. [ArXiv] [Github]

Seeing Clearly, Answering Incorrectly: A Multimodal Robustness Benchmark for Evaluating MLLMs on Leading Questions.
Y Liu, Z Liang, Y Wang, M He, J Li, B Zhao.
arXiv:2406.10638, 2024. [ArXiv] [Github]

Benchmarking Robustness of Adaptation Methods on Pre-trained Vision-Language Models.
S Chen, J Gu, Z Han, Y Ma, P Torr, V Tresp.
Advances in Neural Information Processing Systems, 2024. [NeurIPS] [Github]

Application

Search

MMSearch: Benchmarking the Potential of Large Models as Multi-modal Search Engines.
D Jiang, R Zhang, Z Guo, Y Wu, J Lei, P Qiu, P Lu, Z Chen, G Song, P Gao, Y Liu, C Li, H Li.
arXiv, 2024. [ArXiv] [Github]

Agent

Crab: Cross-environment agent benchmark for multimodal language model agents.
T Xu, L Chen, DJ Wu, Y Chen, Z Zhang, X Yao, Z Xie, Y Chen, S Liu, B Qian, P Torr, et al.
arXiv, 2024. [ArXiv] [Github]

MobileAgentBench: An Efficient and User-Friendly Benchmark for Mobile LLM Agents.
L Wang, Y Deng, Y Zha, G Mao, Q Wang, T Min, W Chen, S Chen.
arXiv:2406.08184, 2024. [ArXiv] [Github]

Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments.
T Xie, D Zhang, J Chen, X Li, S Zhao, R Cao, TJ Hua, Z Cheng, D Shin, F Lei, Y Liu, Y Xu, et al.
arXiv:2404.07972, 2024. [ArXiv] [Github]

VisualAgentBench: Towards Large Multimodal Models as Visual Foundation Agents.
X Liu, T Zhang, et al.
arXiv, 2024. [ArXiv] [Github]

Security

How many unicorns are in this image? a safety evaluation benchmark for vision llms.
H Tu, C Cui, Z Wang, Y Zhou, B Zhao, J Han, W Zhou, H Yao, C Xie.
arXiv:2311.16101, 2023. [ArXiv] [Github]

T2VSafetyBench: Evaluating the Safety of Text-to-Video Generative Models.
Y Miao, Y Zhu, Y Dong, L Yu, J Zhu, XS Gao.
arxiv:2407.05965, 2024. [ArXiv]

MLLMGuard: A Multi-dimensional Safety Evaluation Suite for Multimodal Large Language Models.
T Gu, Z Zhou, K Huang, D Liang, Y Wang, H Zhao, Y Yao, X Qiao, K Wang, Y Yang, Y Teng, et al.
arxiv:2406.07594, 2024. [ArXiv]

Industry

Medical

Gmai-mmbench: A comprehensive multimodal evaluation benchmark towards general medical ai.
P Chen, J Ye, G Wang, Y Li, Z Deng, W Li, T Li, H Duan, Z Huang, Y Su, B Wang, S Zhang, et al.
ArXiv, 2024. [ArXiv] [Huggingface]

Omnimedvqa: A new large-scale comprehensive evaluation benchmark for medical lvlm.
Y Hu, T Li, Q Lu, W Shao, J He, Y Qiao, P Luo.
CVPR, 2024. [CVPR] [Github]

Human-Machine-Interaction

WildVision: Evaluating Vision-Language Models in the Wild with Human Preferences.
Y Lu, D Jiang, W Chen, WY Wang, Y Choi, BY Lin.
arXiv:2406.11069, 2024. [ArXiv] [DataSets]

AlignMMBench: Evaluating Chinese Multimodal Alignment in Large Vision-Language Models.
Y Wu, W Yu, Y Cheng, Y Wang, X Zhang, J Xu, M Ding, Y Dong.
arXiv:2406.09295, 2024. [ArXiv] [Github]

Mmtom-qa: Multimodal theory of mind question answering.
C Jin, Y Wu, J Cao, J Xiang, YL Kuo, Z Hu, T Ullman, A Torralba, JB Tenenbaum, T Shu.
arXiv:2401.08743, 2024. [ArXiv] [Github]

Omini-Modal

Pano-avqa: Grounded audio-visual question answering on 360deg videos.
H Yun, Y Yu, W Yang, K Lee, G Kim.
ICCV, 2021. [ICCV] [Github]

Learning to answer questions in dynamic audio-visual scenarios.
G Li, Y Wei, Y Tian, C Xu, JR Wen, et al.
CFPR, 2022. [CVPR] [Github]

Avqa: A dataset for audio-visual question answering on videos.
P Yang, X Wang, X Duan, H Chen, R Hou, C Jin, W Zhu.
Proceedings of the 30th ACM international conference on multimedia, 2022. [MM] [Github]

Answering Diverse Questions via Text Attached with Key Audio-Visual Clues.
Q Ye, Z Yu, X Liu.
arXiv:2403.06679, 2024. [ArXiv]

Emotion and Intent Joint Understanding in Multimodal Conversation: A Benchmarking Dataset.
R Liu, H Zuo, Z Lian, X Xing, BW Schuller, H Li.
arXiv:2407.02751, 2024. [ArXiv]

Hear Me, See Me, Understand Me: Audio-Visual Autism Behavior Recognition.
S Deng, EE Kosloski, S Patel, ZA Barnett, Y Nan, A Kaplan, S Aarukapalli, WT Doan, et al.
arXiv:2406.02554, 2024. [ArXiv]

Look, Listen, and Answer: Overcoming Biases for Audio-Visual Question Answering.
J Ma, M Hu, P Wang, W Sun, L Song, H Pei, J Liu, Y Du.
arXiv:2404.12020, 2024. [ArXiv]

Merbench: A unified evaluation benchmark for multimodal emotion recognition.
Z Lian, L Sun, Y Ren, H Gu, H Sun, L Chen, B Liu, J Tao.
arXiv:2401.03429, 2024. [ArXiv]

OmniBench: Towards The Future of Universal Omni-Language Models.
Y Li, G Zhang, et al.
ArXiv, 2024. [ArXiv]

OmniXR: Evaluating Omni-modality Language Models on Reasoning across Modalities.
L Chen, H Hu, M Zhang, Y Chen, Z Wang, Y Li, P Shyam, T Zhou, H Huang, MH Yang, et al.
arXiv:2410.12219, 2024. [ArXiv]

VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset.
J Liu, S Chen, X He, L Guo, X Zhu, W Wang, J Tangl.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024. [ArXiv]

Testing-Methods

Task Me Anything.
J Zhang, W Huang, Z Ma, O Michel, D He, et al.
ArXiv, 2024. [ArXiv] [HomePage]

A lightweight generalizable evaluation and enhancement framework for generative models and generated samples.
G Zhao, V Magoulianitis, S You, CCJ Kuo.
WACV, 2024. [ArXiv]

Quality Evaluation

MLLM-Bench: Evaluating Multimodal LLMs with Per-sample Criteria.
W Ge, S Chen, et al.
ArXiv, 2023. [ArXiv] [HomePage]

Mllm-as-a-judge: Assessing multimodal llm-as-a-judge with vision-language benchmark.
D Chen, R Chen, S Zhang, Y Liu, Y Wang, H Zhou, Q Zhang, P Zhou, Y Wan, L Sun.
arXiv:2402.04788, 2024. [ArXiv] [HomePage]

MJ-BENCH Is Your Multimodal Reward Model Really a Good Judge for Text-to-Image Generation.
Z Chen, Y Du, Z Wen, Y Zhou, C Cui, Z Weng, et al.
ArXiv, 2024. [ArXiv] [HomePage]

Testing-Tools

GenAI Arena: An Open Evaluation Platform for Generative Models.
D Jiang, M Ku, T Li, Y Ni, S Sun, R Fan, W Chen.
arxiv:2406.04485, 2024. [ArXiv] [HomePage]

VLMEvalKit
Shanghai AI LAB
[Github]

lmms-eval
LMMs-Lab
[HomePage] [Github]

Multi-Modality-Arena
OpenGVLab
[Github]

Challenges

[MMStar] Are We on the Right Way for Evaluating Large Vision-Language Models?
L Chen, J Li, X Dong, P Zhang, Y Zang, Z Chen, H Duan, J Wang, Y Qiao, D Lin, F Zhao, et al.
ArXiv, 2024. [ArXiv] [Github]

What Are We Measuring When We Evaluate Large Vision-Language Models? An Analysis of Latent Factors and Biases.
AMH Tiong, J Zhao, B Li, J Li, SCH Hoi, et al.
ArXiv, 2024. [ArXiv] [Github]

About

Resources for Vision Languages Models (VLMs) Testing

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published