MMDocBench: Benchmarking Large Vision-Language Models for Fine-Grained Visual Document Understanding
[Project Page] [Paper] [Huggingface Dataset] [Leaderboard]
MMDocBench is an open-sourced benchmark with various OCR-free document understanding tasks for evaluating fine-grained visual perception and reasoning abilities.
This repository contains the evaluation code and instructions for the paper MMDocBench: Benchmarking Large Vision-Language Models for Fine-Grained Visual Document Understanding
For more details, please refer to the project page: https://MMDocBench.github.io/.
MMDocBench consists of 15 main tasks and 48 sub-tasks, involving 2,400 document images, 4,338 QA pairs and 11,353 supporting regions (i.e., bounding boxes). Each QA pair corresponds to one or more supporting regions, marked with red dotted-line rectangles on the image.
[2024.10.10] Our paper, benchmark and project page are publicly available!
For a quick installation of the environment, please run the following script:
git clone https://github.com/fengbin/MMDocBench.git
cd MMDocBench
pip install -e .
For more information, please refer to the Quick Start from VLMEvalKit to set up the environment and prepare the evaluation script.
The dataset can be downloaded from here.
This repo supports the evaluation of the following models with our benchmark:
Close-source Models
- GPT-4o
- GPT-4V
- Qwen-VL-Max
Open-source Models
- InternVL2-Llama3-76B
- LLava-OV-Chat-72b
- mPLUG-DocOwl1.5-Omni
- LLaVA-V1.6-34B
- Yi-VL-34B
- CogVLM2-Chat-19B
- Qwen2-VL-7B-Instruct
- TextMonkey
- MiniCPM-V2.6
- MiniCPM-Llama3-V2.5
- InternVL2-8B
- mPLUG-Owl3
- Ferret
To ensure a smooth evaluation process, make sure to install the specific version of the Transformers library as specified in the repository for the model you wish to evaluate.
Some models are not supported by VLMEvalKit during our benchmark development.
We employed the models and the associated Transformers version for the benchmark as follows:
- mPLUG-DocOwl1.5-Omni: transformers=4.33
- TextMonkey: transformers=4.33
- Ferret: transformers=4.40
Model configuration file can be found at MMDocBench/vlmeval/config.py
.
Model paths should be specified in the configuration file. Additionally, we set the model's max_new_tokens
parameter to 4096 in the config file as there are some tasks in our benchmark that requires long model prediction output. For some models such as llava and mplug_owl3, where the generation configuration is fixed, we modify the model files to support the usage of the max_new_tokens
parameter.
To run evaluation with either python
or torchrun
as follows to evaluate one model with MMDocBench.
- python
LMUData=/PATH/TO/TSV/DATA/ python run.py --data MMDocBench --model /MODEL/PATH/ --mode all --work-dir /PATH/TO/WORK_DIR/ --verbose
- torchrun
LMUData=/PATH/TO/TSV/DATA/ torchrun --nproc-per-node=8 run.py --data MMDocBench --model /MODEL/PATH/ --mode all --work-dir /PATH/TO/WORK_DIR/ --verbose
Please refer to the VLMEvalKit repository to add the new model you wish to evaluate with MMDocBench.
@misc{zhu2024mmdocbench
title={MMDocBench: Benchmarking Large Vision-Language Models for Fine-Grained Visual Document Understanding},
author={Fengbin Zhu and Ziyang Liu and Xiang Yao Ng and Haohui Wu and Wenjie Wang and Fuli Feng and Chao Wang and Huanbo Luan and Tat Seng Chua},
year={2024},
eprint={2410.21311},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2410.21311},
}
The benchmark is distributed under the CC BY-SA 4.0 license and the code is licensed under Apache 2.0 license following VLMEvalKit.