Vision Languages Models (VLMs) Testing Resources: A curated list of Awesome VLMs Testing Papers with Codes, check 📖Contents for more details. This repo is still updated frequently ~ 👨💻 Welcome to star ⭐️ or submit a PR to this repo! I will review and merge it.
- 📖Review
- 📖General
- 📖Security
- 📖Application
- 📖Industry
- 📖Human-Machine-Interaction
- 📖Omini-Modal
- 📖Testing-Methods
- 📖Testing-Tools
- 📖Challenges
A Survey on Benchmarks of Multimodal Large Language Models.
J Li, W Lu.
ArXiv, 2024.
[ArXiv]
[Github]
A Survey on Evaluation of Multimodal Large Language Models.
J Huang, J Zhang.
arxiv:2408.15769, 2024.
[ArXiv]
From GPT-4 to Gemini and Beyond: Assessing the Landscape of MLLMs on Generalizability, Trustworthiness and Causality through Four Modalities.
C Lu, C Qian, G Zheng, H Fan, H Gao, J Zhang, J Shao, J Deng, J Fu, K Huang, K Li, L Li, et al.
ArXiv, 2024.
[ArXiv]
[Github]
[Lvlm-ehub] Lvlm-ehub: A comprehensive evaluation benchmark for large vision-language models.
P Xu, W Shao, K Zhang, P Gao, S Liu, M Lei, F Meng, S Huang, Y Qiao, P Luo.
arXiv:2306.09265, 2023.
[ArXiv]
[Mmbench] Mmbench: Is your multi-modal model an all-around player?
Y Liu, H Duan, Y Zhang, B Li, S Zhang, W Zhao, Y Yuan, J Wang, C He, Z Liu, K Chen, D Lin.
arXiv:2307.06281, 2023.
[ArXiv]
[Github]
[MME] MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models.
Fu, Chaoyou and Chen, Peixian and Shen, Yunhang and Qin, Yulei and Zhang, Mengdan and Lin, Xu and Yang, Jinrui and Zheng, Xiawu and Li, Ke and Sun, Xing and others.
arXiv:2306.13394, 2023.
[Mm-vet] Mm-vet: Evaluating large multimodal models for integrated capabilities.
W Yu, Z Yang, L Li, J Wang, K Lin, Z Liu, X Wang, L Wang.
arXiv:2308.02490, 2023.
[ArXiv]
[Github]
[OwlEval] mplug-owl: Modularization empowers large language models with multimodality.
Q Ye, H Xu, G Xu, J Ye, M Yan, Y Zhou, J Wang, A Hu, P Shi, Y Shi, C Li, Y Xu, H Chen, et al.
arXiv:2304.14178, 2023.
[ArXiv]
[Seed-bench] Seed-bench: Benchmarking multimodal llms with generative comprehension.
B Li, R Wang, G Wang, Y Ge, Y Ge, Y Shan.
arXiv:2307.16125, 2023.
[ArXiv]
[Github]
[Touchstone] Touchstone: Evaluating vision-language models by language models.
S Bai, S Yang, J Bai, P Wang, X Zhang, J Lin, X Wang, C Zhou, J Zhou.
arXiv:2308.16890, 2023.
[ArXiv]
[Github]
Video-bench: A comprehensive benchmark and toolkit for evaluating video-based large language models.
M Ning, B Zhu, Y **e, B Lin, J Cui, L Yuan, D Chen, L Yuan.
arxiv:2311.16103, 2023.
[ArXiv]
[Github]
Towards an Exhaustive Evaluation of Vision-Language Foundation Models.
E Salin, S Ayache, B Favre.
ICCV, 2023.
[Paper]
[HR-Bench] Divide, Conquer and Combine: A Training-Free Framework for High-Resolution Image Perception in Multimodal Large Language Models.
W Wang, L Ding, M Zeng, X Zhou, L Shen, Y Luo, D Tao.
ArXiv, 2024.
[ArXiv]
[Github]
[Blink] Blink: Multimodal large language models can see but not perceive.
X Fu, Y Hu, B Li, Y Feng, H Wang, X Lin, D Roth, NA Smith, WC Ma, R Krishna.
arxiv:2404.12390, 2024.
[ArXiv]
[Github]
[MME-RealWorld] MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?
YF Zhang, H Zhang, H Tian, C Fu, S Zhang, J Wu, F Li, K Wang, Q Wen, Z Zhang, L Wang, et al.
arXiv:2408.13257, 2024.
[ArXiv]
[Mmt-bench] Mmt-bench: A comprehensive multimodal benchmark for evaluating large vision-language models towards multitask agi.
K Ying, F Meng, J Wang, Z Li, H Lin, Y Yang, et al.
arXiv, 2024.
[ArXiv]
[Github]
[HuggingFace]
[MuirBench] MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding.
F Wang, X Fu, JY Huang, Z Li, Q Liu, X Liu, MD Ma, N Xu, W Zhou, K Zhang, TL Yan, WJ Mo, et al.
arXiv:2406.09411, 2024.
[ArXiv]
[Github]
[HuggingFace]
[MMMU] Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi.
X Yue, Y Ni, K Zhang, T Zheng, R Liu, G Zhang, S Stevens, D Jiang, W Ren, Y Sun, C Wei, et al.
CVPR, 2024.
[Paper]
[Github]
[Vbench] Vbench: Comprehensive benchmark suite for video generative models.
Z Huang, Y He, J Yu, F Zhang, C Si, Y Jiang, Y Zhang, T Wu, Q Jin, N Chanpaisit, Y Wang, et al.
CVPR, 2024.
[ArXiv]
[Github]
Beyond task performance: Evaluating and reducing the flaws of large multimodal models with in-context learning.
M Shukor, A Rame, C Dancette, M Cord.
ICLR, 2024.
[ArXiv]
[Github]
Date | Task | Title | Paper | HomePage | Github | DataSets |
---|---|---|---|---|---|---|
2023 | Content | [MM-BigBench] MM-BigBench: Evaluating Multimodal Models on Multimodal Content Comprehension Tasks. | [ArXiv] | - | [Github] | - |
2024 | Dialog | [MMDU] MMDU: A Multi-Turn Multi-Image Dialog Understanding Benchmark and Instruction-Tuning Dataset for LVLM. | [ArXiv] | - | [Github] | - |
2024 | Relation | [CRPE] The all-seeing project v2: Towards general relation comprehension of the open world. | [ArXiv] | - | [Github] | [HuggingFace] |
2023 | Image | [Journeydb] Journeydb: A benchmark for generative image understanding. | [NeurIPS] | - | [Github] | - |
2024 | Image | [MMIU] MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language Models. | [ArXiv] | - | [Github] | - |
2024 | Image | [MMLongBench-Doc] MMLongBench-Doc: Benchmarking Long-context Document Understanding with Visualizations. | [ArXiv] | - | [Github] | - |
2024 | Video | [ET Bench] ET Bench: Towards Open-Ended Event-Level Video-Language Understanding. | [ArXiv] | - | [Github] | - |
2024 | Video | [MVBench] MVBench: A Comprehensive Multi-modal Video Understanding Benchmark. | [CVPR] | - | [Github] | - |
2024 | Video | [VideoVista] VideoVista: A Versatile Benchmark for Video Understanding and Reasoning. | [ArXiv] | - | [Github] | - |
2024 | Video | [MLVU] MLVU: A Comprehensive Benchmark for Multi-Task Long Video Understanding. | [ArXiv] | - | [Github] | - |
Date | Task | Title | Paper | HomePage | Github | DataSets |
---|---|---|---|---|---|---|
2023 | Text-to-Image | Let's ViCE! Mimicking Human Cognitive Behavior in Image Generation Evaluation. | [ACMMM] | - | - | - |
2023 | Text-to-Image | Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis. | [ArXiv] | - | [Github] | - |
2023 | Text-to-Image | Pku-i2iqa: An image-to-image quality assessment database for ai generated images. | [ArXiv] | - | [Github] | - |
2023 | Text-to-Image | Toward verifiable and reproducible human evaluation for text-to-image generation. | [CVPR] | - | - | - |
2023 | Text-to-Image | Hrs-bench: Holistic, reliable and scalable benchmark for text-to-image models. | [ICCV] | - | [Github] | - |
2023 | Text-to-Image | T2i-compbench: A comprehensive benchmark for open-world compositional text-to-image generation. | [NeurIPS] | - | [Github] | - |
2023 | Text-to-Image | Agiqa-3k: An open database for ai-generated image quality assessment. | [TCSVT] | - | [Github] | - |
2024 | Text-to-Image | Evaluating Text to Image Synthesis: Survey and Taxonomy of Image Quality Metrics. | [ArXiv] | - | [Github] | - |
2024 | Text-to-Image | Evaluating text-to-visual generation with image-to-text generation. | [ArXiv] | - | - | - |
2024 | Text-to-Image | UNIAA: A Unified Multi-modal Image Aesthetic Assessment Baseline and Benchmark. | [ArXiv] | - | - | - |
2024 | Text-to-Image | Aigiqa-20k: A large database for ai-generated image quality assessment. | [CVPRW] | - | - | [DataSets] |
2024 | Text-to-Image | Holistic evaluation of text-to-image models. | [NeurIPS] | - | [Github] | - |
2024 | Text-to-Image | Imagereward: Learning and evaluating human preferences for text-to-image generation. | [NeurIPS] | - | - | - |
2024 | Text-to-Image | Llmscore: Unveiling the power of large language models in text-to-image synthesis evaluation. | [NeurIPS] | - | [Github] | - |
2024 | Text-to-Image | EVALALIGN: Supervised Fine-Tuning Multimodal LLMs with Human-Aligned Data for Evaluating Text-to-Image Models. | [ArXiv] | - | [Github] | - |
2024 | Text-to-Image | PhyBench: A Physical Commonsense Benchmark for Evaluating Text-to-Image Models. | [ArXiv] | - | [Github] | - |
2024 | Text-to-Image | PTlTScore: Towards Long-Tail Effects in Text-to-Visual Evaluation with Generative Foundation Models. | [CVPR] | - | - | - |
2024 | Text-to-Image | FlashEval: Towards Fast and Accurate Evaluation of Text-to-image Diffusion Generative Models. | [CVPR] | - | [Github] | - |
Date | Task | Title | Paper | HomePage | Github | DataSets |
---|---|---|---|---|---|---|
2023 | Text-to-Video | Fetv: A benchmark for fine-grained evaluation of open-domain text-to-video generation. | [NeurIPS] | - | [Github] | - |
2024 | Text-to-Video | AIGCBench: Comprehensive Evaluation of Image-to-Video Content Generated by AI. | [ArXiv] | - | [Github] | - |
2024 | Text-to-Video | Evaluation of Text-to-Video Generation Models: A Dynamics Perspective. | [ArXiv] | - | [Github] | - |
2024 | Text-to-Video | GAIA: Rethinking Action Quality Assessment for AI-Generated Videos. | [ArXiv] | - | [Github] | - |
2024 | Text-to-Video | Subjective-Aligned Dateset and Metric for Text-to-Video Quality Assessment. | [ArXiv] | - | [Github] | - |
2024 | Text-to-Video | T2V-CompBench: A Comprehensive Benchmark for Compositional Text-to-video Generation. | [ArXiv] | - | [Github] | - |
2024 | Text-to-Video | TC-Bench: Benchmarking Temporal Compositionality in Text-to-Video and Image-to-Video Generation. | [ArXiv] | - | [Github] | - |
2024 | Text-to-Video | VideoPhy: Evaluating Physical Commonsense for Video Generation. | [ArXiv] | - | [Github] | - |
2024 | Text-to-Video | MantisScore: Building Automatic Metrics to Simulate Fine-grained Human Feedback for Video Generation. | [ArXiv] | - | [Github] | - |
2024 | Text-to-Video | AIGC-VQA: A Holistic Perception Metric for AIGC Video Quality Assessment. | [CVPR] | - | - | - |
2024 | Text-to-Video | Evalcrafter: Benchmarking and evaluating large video generation models. | [CVPR] | [Homepage] | [Github] | - |
2024 | Text-to-Video | T2VBench: Benchmarking Temporal Dynamics for Text-to-Video Generation. | [CVPR] | - | [Github] | - |
2024 | Text-to-Video | Benchmarking AIGC Video Quality Assessment: A Dataset and Unified Model. | [ArXiv] | - | - | - |
Vqa: Visual question answering.
S Antol, A Agrawal, J Lu, M Mitchell, et al.
ICCV, 2015.
[Paper]
[Homepage]
Making the v in vqa matter: Elevating the role of image understanding in visual question answering.
Y Goyal, T Khot, D Summers-Stay, D Batra, D Parikh, et al.
CVPR, 2017.
[Paper]
[Homepage]
[Ok-vqa] Ok-vqa: A visual question answering benchmark requiring external knowledge.
K Marino, M Rastegari, A Farhadi, R Mottaghi.
CVPR, 2019.
[Paper]
[TextVQA] Towards VQA Models That Can Read.
A Singh, V Natarajan, M Shah, et al.
CVPR, 2019.
[Paper]
[Homepage]
[DocVQA] Docvqa: A dataset for vqa on document images.
M Mathew, D Karatzas, CV Jawahar.
WACV, 2021.
[Paper]
[Homepage]
[ChartQA] ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasonin.
A Masry, DX Long, JQ Tan, S Joty, E Hoque.
arxiv:2203.10244, 2022.
[Paper]
[ScienceQA] Learn to explain: Multimodal reasoning via thought chains for science question answering.
P Lu, S Mishra, T Xia, L Qiu, KW Chang, SC Zhu, O Tafjord, P Clark, A Kalyan.
Advances in Neural Information Processing Systems, 2022.
[NeurIPS]
[Github]
KNVQA: A Benchmark for evaluation knowledge-based VQA.
S Cheng, S Zhang, J Wu, M Lan.
arXiv:2311.12639, 2023.
Maqa: A multimodal qa benchmark for negation.
JY Li, A Jansen, Q Huang, J Lee, R Ganti, D Kuzmin.
arXiv:2301.03238, 2023.
[ArXiv]
Multimodal multi-hop question answering through a conversation between tools and efficiently finetuned large language models.
H Rajabzadeh, S Wang, HJ Kwon, B Liu.
arXiv:2309.08922, 2023.
[ArXiv]
Scigraphqa: A large-scale synthetic multi-turn question-answering dataset for scientific graphs.
S Li, N Tajbakhsh.
arXiv:2308.03349, 2023.
[ArXiv]
Slidevqa: A dataset for document visual question answering on multiple images.
R Tanaka, K Nishida, K Nishida, T Hasegawa, I Saito, K Saito.
AAAI, 2023.
[ArXiv]
[Github]
CMMU: A Benchmark for Chinese Multi-modal Multi-type Question Understanding and Reasoning.
Z He, X Wu, P Zhou, R Xuan, G Liu, X Yang, Q Zhu, H Huang.
arXiv:2401.14011, 2024.
[ArXiv]
[Github]
TableVQA-Bench: A Visual Question Answering Benchmark on Multiple Table Domains.
Y Kim, M Yim, KY Song.
arXiv:2404.19205, 2024.
[ArXiv]
[Github]
MTVQA: Benchmarking Multilingual Text-Centric Visual Question Answering.
J Tang, Q Liu, Y Ye, J Lu, S Wei, C Lin, W Li, MFFB Mahmood, H Feng, Z Zhao, Y Wang, et al.
arXiv:2405.11985, 2024.
[ArXiv]
[Github]
InfiMM-Eval: Complex Open-Ended Reasoning Evaluation For Multi-Modal Large Language Models.
X Han, Q You, Y Liu, W Chen, H Zheng, K Mrini, et al.
arXiv:2311.11567, 2023.
[ArXiv]
Measuring and improving chain-of-thought reasoning in vision-language models.
Y Chen, K Sikka, M Cogswell, H Ji, A Divakaran.
arXiv:2309.04461, 2023.
[ArXiv]
[Github]
Compbench: A comparative reasoning benchmark for multimodal llms.
J Kil, Z Mai, J Lee, Z Wang, K Cheng, L Wang, Y Liu, A Chowdhury, WL Chao.
arXiv:2407.16837, 2024.
[ArXiv]
[Github]
Complex Video Reasoning and Robustness Evaluation Suite for Video-LMMs.
MU Khattak, MF Naeem, J Hassan, M Naseer, et al.
arXiv, 2024.
[ArXiv]
[Github]
ConTextual: Evaluating Context-Sensitive Text-Rich Visual Reasoning in Large Multimodal Models.
R Wadhawan, H Bansal, KW Chang, N Peng.
arXiv:2401.13311, 2024.
[ArXiv]
[Github]
Exploring the reasoning abilities of multimodal large language models (mllms): A comprehensive survey on emerging trends in multimodal reasoning.
Y Wang, W Chen, X Han, X Lin, H Zhao, Y Liu, B Zhai, J Yuan, Q You, H Yang.
arXiv:2401.06805, 2024.
[ArXiv]
Mementos: A comprehensive benchmark for multimodal large language model reasoning over image sequences.
X Wang, Y Zhou, X Liu, H Lu, Y Xu, F He, J Yoon, T Lu, G Bertasius, M Bansal, H Yao, et al.
arXiv:2401.10529, 2024.
[ArXiv]
[Github]
NPHardEval4V: A Dynamic Reasoning Benchmark of Multimodal Large Language Models.
L Fan, W Hua, X Li, K Zhu, M Jin, L Li, H Ling, J Chi, J Wang, X Ma, Y Zhang.
arXiv:2403.01777, 2024.
[ArXiv]
[Github]
Mathvista: Evaluating math reasoning in visual contexts with gpt-4v, bard, and other large multimodal models.
P Lu, H Bansal, T **a, J Liu, C Li, H Hajishirzi, et al.
ICLR, 2024.
[Homepage]
[Github]
MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark.
X Yue, T Zheng, Y Ni, Y Wang, K Zhang, S Tong, Y Sun, M Yin, B Yu, G Zhang, H Sun, Y Su, et al.
arxiv:2409.02813, 2024.
[ArXiv]
[Github]
[MATH-V] Measuring multimodal mathematical reasoning with math-vision dataset.
K Wang, J Pan, W Shi, Z Lu, M Zhan, H Li.
arXiv:2402.14804, 2024.
[ArXiv]
[Github]
[MMMU] Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi.
X Yue, Y Ni, K Zhang, T Zheng, R Liu, G Zhang, S Stevens, D Jiang, W Ren, Y Sun, C Wei, et al.
CVPR, 2024.
[Paper]
[Github]
[Mathverse] Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems?.
R Zhang, D Jiang, Y Zhang, H Lin, Z Guo, P Qiu, A Zhou, P Lu, KW Chang, P Gao, H Li.
arXiv:2403.14624, 2024.
[ArXiv]
[Github]
CVQA: Culturally-diverse Multilingual Visual Question Answering Benchmark.
D Romero, C Lyu, HA Wibowo, T Lynn, I Hamed, AN Kishore, A Mandal, A Dragonetti, et al.
arXiv:2406.05967, 2024.
[Paper]
[DataSets]
[MMMB] Parrot: Multilingual Visual Instruction Tuning.
HL Sun, DW Zhou, Y Li, S Lu, C Yi, QG Chen, Z Xu, W Luo, K Zhang, DC Zhan, HJ Ye.
arXiv:2406.02539, 2024.
[Paper]
[Github]
[Visit-bench] Visit-bench: A benchmark for vision-language instruction following inspired by real-world use.
Y Bitton, H Bansal, J Hessel, R Shao, W Zhu, A Awadalla, J Gardner, R Taori, L Schimdt.
arXiv:2308.06595, 2023.
[ArXiv]
[Github]
Visual instruction tuning.
H Liu, C Li, Q Wu, YJ Lee.
NeurIPS, 2024.
[Paper]
[Homepage]
MIA-Bench: Towards Better Instruction Following Evaluation of Multimodal LLMs.
Y Qian, H Ye, JP Fauconnier, P Grasch, Y Yang, et al.
arXiv, 2024.
[Paper]
[Homepage]
[OCRBench] On the hidden mystery of ocr in large multimodal models.
Y Liu, Z Li, H Li, W Yu, M Huang, D Peng, M Liu, M Chen, C Li, L Jin, X Bai.
arXiv:2305.07895, 2023.
[ArXiv]
[Github]
[Aesbench] Aesbench: An expert benchmark for multimodal large language models on image aesthetics perception.
Y Huang, Q Yuan, X Sheng, Z Yang, H Wu, P Chen, Y Yang, L Li, W Lin.
arXiv:2401.08276, 2024.
[ArXiv]
[Github]
[A-Bench] A-Bench: Are LMMs Masters at Evaluating AI-generated Images?
Z Zhang, H Wu, C Li, Y Zhou, W Sun, X Min, Z Chen, X Liu, W Lin, G Zhai.
arXiv:2406.03070, 2024.
[ArXiv]
[Github]
Q-bench: A benchmark for general-purpose foundation models on low-level vision.
H Wu, Z Zhang, E Zhang, C Chen, L Liao, A Wang, C Li, W Sun, Q Yan, G Zhai, W Lin.
arXiv:2309.14181, 2023.
[ArXiv]
[Github]
A Benchmark for Multi-modal Foundation Models on Low-level Vision: from Single Images to Pairs.
Z Zhang, H Wu, E Zhang, G Zhai, W Lin.
arXiv:2402.07116, 2024.
[ArXiv]
[Github]
Date | Task | Title | Paper | HomePage | Github | DataSets |
---|---|---|---|---|---|---|
2023 | Hallucination | An llm-free multi-dimensional benchmark for mllms hallucination evaluation. | [ArXiv] | - | [Github] | - |
2024 | Hallucination | [POPE] Evaluating object hallucination in large vision-language models. | [ArXiv] | - | - | - |
2024 | Hallucination | LMMs-Eval: Reality Check on the Evaluation of Large Multimodal Models. | [ArXiv] | - | - | - |
2024 | Hallucination | [Hallusionbench] Hallusionbench: You see what you think? or you think what you see? an image-context reasoning benchmark challenging for gpt-4v (ision), llava-1.5, and other multi-modality models. | [CVPR] | - | [Github] | - |
2024 | Hallucination | Evaluating the Quality of Hallucination Benchmarks for Large Vision-Language Models. | [ArXiv] | - | [Github] | - |
Fool your (vision and) language model with embarrassingly simple permutations.
Y Zong, T Yu, et al.
arXiv, 2024.
[ArXiv]
[Github]
Seeing Clearly, Answering Incorrectly: A Multimodal Robustness Benchmark for Evaluating MLLMs on Leading Questions.
Y Liu, Z Liang, Y Wang, M He, J Li, B Zhao.
arXiv:2406.10638, 2024.
[ArXiv]
[Github]
Benchmarking Robustness of Adaptation Methods on Pre-trained Vision-Language Models.
S Chen, J Gu, Z Han, Y Ma, P Torr, V Tresp.
Advances in Neural Information Processing Systems, 2024.
[NeurIPS]
[Github]
MMSearch: Benchmarking the Potential of Large Models as Multi-modal Search Engines.
D Jiang, R Zhang, Z Guo, Y Wu, J Lei, P Qiu, P Lu, Z Chen, G Song, P Gao, Y Liu, C Li, H Li.
arXiv, 2024.
[ArXiv]
[Github]
Crab: Cross-environment agent benchmark for multimodal language model agents.
T Xu, L Chen, DJ Wu, Y Chen, Z Zhang, X Yao, Z Xie, Y Chen, S Liu, B Qian, P Torr, et al.
arXiv, 2024.
[ArXiv]
[Github]
MobileAgentBench: An Efficient and User-Friendly Benchmark for Mobile LLM Agents.
L Wang, Y Deng, Y Zha, G Mao, Q Wang, T Min, W Chen, S Chen.
arXiv:2406.08184, 2024.
[ArXiv]
[Github]
Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments.
T Xie, D Zhang, J Chen, X Li, S Zhao, R Cao, TJ Hua, Z Cheng, D Shin, F Lei, Y Liu, Y Xu, et al.
arXiv:2404.07972, 2024.
[ArXiv]
[Github]
VisualAgentBench: Towards Large Multimodal Models as Visual Foundation Agents.
X Liu, T Zhang, et al.
arXiv, 2024.
[ArXiv]
[Github]
How many unicorns are in this image? a safety evaluation benchmark for vision llms.
H Tu, C Cui, Z Wang, Y Zhou, B Zhao, J Han, W Zhou, H Yao, C Xie.
arXiv:2311.16101, 2023.
[ArXiv]
[Github]
T2VSafetyBench: Evaluating the Safety of Text-to-Video Generative Models.
Y Miao, Y Zhu, Y Dong, L Yu, J Zhu, XS Gao.
arxiv:2407.05965, 2024.
[ArXiv]
MLLMGuard: A Multi-dimensional Safety Evaluation Suite for Multimodal Large Language Models.
T Gu, Z Zhou, K Huang, D Liang, Y Wang, H Zhao, Y Yao, X Qiao, K Wang, Y Yang, Y Teng, et al.
arxiv:2406.07594, 2024.
[ArXiv]
Gmai-mmbench: A comprehensive multimodal evaluation benchmark towards general medical ai.
P Chen, J Ye, G Wang, Y Li, Z Deng, W Li, T Li, H Duan, Z Huang, Y Su, B Wang, S Zhang, et al.
ArXiv, 2024.
[ArXiv]
[Huggingface]
Omnimedvqa: A new large-scale comprehensive evaluation benchmark for medical lvlm.
Y Hu, T Li, Q Lu, W Shao, J He, Y Qiao, P Luo.
CVPR, 2024.
[CVPR]
[Github]
WildVision: Evaluating Vision-Language Models in the Wild with Human Preferences.
Y Lu, D Jiang, W Chen, WY Wang, Y Choi, BY Lin.
arXiv:2406.11069, 2024.
[ArXiv]
[DataSets]
AlignMMBench: Evaluating Chinese Multimodal Alignment in Large Vision-Language Models.
Y Wu, W Yu, Y Cheng, Y Wang, X Zhang, J Xu, M Ding, Y Dong.
arXiv:2406.09295, 2024.
[ArXiv]
[Github]
Mmtom-qa: Multimodal theory of mind question answering.
C Jin, Y Wu, J Cao, J Xiang, YL Kuo, Z Hu, T Ullman, A Torralba, JB Tenenbaum, T Shu.
arXiv:2401.08743, 2024.
[ArXiv]
[Github]
Pano-avqa: Grounded audio-visual question answering on 360deg videos.
H Yun, Y Yu, W Yang, K Lee, G Kim.
ICCV, 2021.
[ICCV]
[Github]
Learning to answer questions in dynamic audio-visual scenarios.
G Li, Y Wei, Y Tian, C Xu, JR Wen, et al.
CFPR, 2022.
[CVPR]
[Github]
Avqa: A dataset for audio-visual question answering on videos.
P Yang, X Wang, X Duan, H Chen, R Hou, C Jin, W Zhu.
Proceedings of the 30th ACM international conference on multimedia, 2022.
[MM]
[Github]
Answering Diverse Questions via Text Attached with Key Audio-Visual Clues.
Q Ye, Z Yu, X Liu.
arXiv:2403.06679, 2024.
[ArXiv]
Emotion and Intent Joint Understanding in Multimodal Conversation: A Benchmarking Dataset.
R Liu, H Zuo, Z Lian, X Xing, BW Schuller, H Li.
arXiv:2407.02751, 2024.
[ArXiv]
Hear Me, See Me, Understand Me: Audio-Visual Autism Behavior Recognition.
S Deng, EE Kosloski, S Patel, ZA Barnett, Y Nan, A Kaplan, S Aarukapalli, WT Doan, et al.
arXiv:2406.02554, 2024.
[ArXiv]
Look, Listen, and Answer: Overcoming Biases for Audio-Visual Question Answering.
J Ma, M Hu, P Wang, W Sun, L Song, H Pei, J Liu, Y Du.
arXiv:2404.12020, 2024.
[ArXiv]
Merbench: A unified evaluation benchmark for multimodal emotion recognition.
Z Lian, L Sun, Y Ren, H Gu, H Sun, L Chen, B Liu, J Tao.
arXiv:2401.03429, 2024.
[ArXiv]
OmniBench: Towards The Future of Universal Omni-Language Models.
Y Li, G Zhang, et al.
ArXiv, 2024.
[ArXiv]
OmniXR: Evaluating Omni-modality Language Models on Reasoning across Modalities.
L Chen, H Hu, M Zhang, Y Chen, Z Wang, Y Li, P Shyam, T Zhou, H Huang, MH Yang, et al.
arXiv:2410.12219, 2024.
[ArXiv]
VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset.
J Liu, S Chen, X He, L Guo, X Zhu, W Wang, J Tangl.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024.
[ArXiv]
Task Me Anything.
J Zhang, W Huang, Z Ma, O Michel, D He, et al.
ArXiv, 2024.
[ArXiv]
[HomePage]
A lightweight generalizable evaluation and enhancement framework for generative models and generated samples.
G Zhao, V Magoulianitis, S You, CCJ Kuo.
WACV, 2024.
[ArXiv]
MLLM-Bench: Evaluating Multimodal LLMs with Per-sample Criteria.
W Ge, S Chen, et al.
ArXiv, 2023.
[ArXiv]
[HomePage]
Mllm-as-a-judge: Assessing multimodal llm-as-a-judge with vision-language benchmark.
D Chen, R Chen, S Zhang, Y Liu, Y Wang, H Zhou, Q Zhang, P Zhou, Y Wan, L Sun.
arXiv:2402.04788, 2024.
[ArXiv]
[HomePage]
MJ-BENCH Is Your Multimodal Reward Model Really a Good Judge for Text-to-Image Generation.
Z Chen, Y Du, Z Wen, Y Zhou, C Cui, Z Weng, et al.
ArXiv, 2024.
[ArXiv]
[HomePage]
GenAI Arena: An Open Evaluation Platform for Generative Models.
D Jiang, M Ku, T Li, Y Ni, S Sun, R Fan, W Chen.
arxiv:2406.04485, 2024.
[ArXiv]
[HomePage]
VLMEvalKit
Shanghai AI LAB
[Github]
lmms-eval
LMMs-Lab
[HomePage]
[Github]
Multi-Modality-Arena
OpenGVLab
[Github]
[MMStar] Are We on the Right Way for Evaluating Large Vision-Language Models?
L Chen, J Li, X Dong, P Zhang, Y Zang, Z Chen, H Duan, J Wang, Y Qiao, D Lin, F Zhao, et al.
ArXiv, 2024.
[ArXiv]
[Github]
What Are We Measuring When We Evaluate Large Vision-Language Models? An Analysis of Latent Factors and Biases.
AMH Tiong, J Zhao, B Li, J Li, SCH Hoi, et al.
ArXiv, 2024.
[ArXiv]
[Github]