Hal-Eval: A Universal and Fine-grained Hallucination Evaluation Framework for Large Vision Language Models
π€ Hugging Face β’ π PKU-KCL β’ π€ Demo: Hal-Evaluator β’ β Paper
Please refer to our preprint paper: Hal-Eval: A Universal and Fine-grained Hallucination Evaluation Framework for Large Vision Language Models.
Large Vision-Language Models (LVLMs) exhibit remarkable capabilities but struggle with "hallucinations"βinconsistencies between images and their descriptions. Previous hallucination evaluation studies on LVLMs have identified hallucinations in terms of objects, attributes, and relations but overlooked complex hallucinations that create an entire narrative around a fictional entity. In this work, We introduce a refined taxonomy of hallucinations, featuring a new category: Event Hallucination. We then utilize advanced LLMs to generate and filter fine-grained hallucinatory data consisting of various types of hallucinations, with a particular focus on event hallucinations, laying the groundwork for integrating discriminative and generative evaluation methods within our universal evaluation framework Hal-Eval. The proposed benchmark distinctively assesses LVLMs' ability to tackle a broad spectrum of hallucinations, making it a reliable and comprehensive tool for gauging LVLMs' efficacy in handling hallucinations.
Benchmark | Tasks | Discriminative Hallucination | Generative Hallucination | |||||||
---|---|---|---|---|---|---|---|---|---|---|
Dis | Gen | Object | Attribute | Relation | Event | Object | Attribute | Relation | Event | |
POPE | βοΈ | β | βοΈ | β | β | β | β | β | β | β |
NOPE | βοΈ | β | βοΈ | β | β | β | β | β | β | β |
CIEM | βοΈ | β | βοΈ | β | β | β | β | β | β | β |
M-HalDetect | β | βοΈ | β | β | β | β | βοΈ | βοΈ | βοΈ | β |
GAVIE | β | βοΈ | β | β | β | β | βοΈ | βοΈ | β | β |
FAITHScore | β | βοΈ | β | β | β | β | βοΈ | βοΈ | βοΈ | β |
MMhal-Bench | β | βοΈ | β | β | β | - | - | - | - | β |
HaELM | β | βοΈ | β | β | β | - | - | - | - | β |
AMBER | βοΈ | βοΈ | βοΈ | βοΈ | βοΈ | β | βοΈ | β | β | β |
Hal-Eval | βοΈ | βοΈ | βοΈ | βοΈ | βοΈ | βοΈ | βοΈ | βοΈ | βοΈ | βοΈ |
The existing multimodal hallucination research suffers from a lack of large-scale datasets with fine-grained annotations specific to hallucinations. To address this issue, we design AFHA, an Automatic Fine-grained Hallucination Annotation pipline featuring annotations for four hallucination types and specific hallucination content. For more details, please refer to our paper paper
Hal-Eval is divided into two distinct segments: Discriminative Evaluation and Generative Evaluation. We have opted to assess five widely utilized open-source LVLMs: MiniGPT-4 , InstructBLIP , mPLUG-owl, LLaVA, LLaVA1.5.
Our evaluation dataset is split into two parts. One part consists of in-domain evaluation data, composed of image-text pairs from COCO 2014 validation and COCO 2017 test sets. The other part is sourced from web-based datasets such as CC, SBU , and LAION, providing out-of-domain data sampled randomly. We provide the evaluation data.
Please set the parameters in the 5k_code.py and run the following command in terminal:
CUDA_VISIBLE_DEVICES=0 python 5k_code.py
Dataset | Type | Model | Accuracy | Precision | Recall | F1 | Yes (%) |
---|---|---|---|---|---|---|---|
In-domain | Object | mPLUG-Owl | 49.8 | 49.8 | 44.7 | 47.1 | 44.1 |
LLaVA | 52.6 | 55.5 | 26.3 | 35.7 | 23.6 | ||
MiniGPT-4 | 50.4 | 50.3 | 46.5 | 48.3 | 40.2 | ||
InstructBLIP | 50.0 | 50.0 | 99.0 | 66.5 | 98.0 | ||
LLaVA 1.5 | 62.2 | 76.1 | 35.6 | 48.5 | 23.3 | ||
In-domain | Attribute | mPLUG-Owl | 49.9 | 49.9 | 44.7 | 47.2 | 44.6 |
LLaVA | 52.8 | 55.9 | 26.3 | 35.8 | 23.5 | ||
MiniGPT-4 | 51.1 | 51.1 | 46.5 | 48.7 | 39.4 | ||
InstructBLIP | 49.8 | 49.8 | 99.0 | 66.3 | 98.1 | ||
LLaVA 1.5 | 62.2 | 76.1 | 35.6 | 48.5 | 23.3 | ||
In-domain | Relation | mPLUG-Owl | 50.4 | 50.5 | 44.7 | 47.4 | 44.7 |
LLaVA | 52.7 | 55.7 | 26.3 | 35.8 | 23.7 | ||
MiniGPT-4 | 50.4 | 50.1 | 46.5 | 48.2 | 40.0 | ||
InstructBLIP | 49.8 | 49.9 | 99.0 | 66.3 | 97.7 | ||
LLaVA 1.5 | 55.4 | 59.1 | 35.6 | 44.4 | 22.1 | ||
In-domain | Event | mPLUG-Owl | 49.7 | 49.7 | 44.6 | 47.0 | 44.8 |
LLaVA | 51.5 | 53.0 | 26.3 | 35.1 | 24.8 | ||
MiniGPT-4 | 49.6 | 50.0 | 46.5 | 48.2 | 40.3 | ||
InstructBLIP | 49.6 | 49.7 | 99.0 | 66.2 | 84.3 | ||
LLaVA 1.5 | 62.7 | 77.9 | 35.6 | 48.9 | 22.8 |
Dataset | Type | Model | Accuracy | Precision | Recall | F1 | Yes (%) |
---|---|---|---|---|---|---|---|
Out-of-domain | Object | mPLUG-Owl | 50.3 | 50.4 | 43.6 | 46.8 | 43.4 |
LLaVA | 50.7 | 52.7 | 9.0 | 15.3 | 7.2 | ||
MiniGPT-4 | 50.3 | 51.7 | 53.6 | 52.6 | 25.0 | ||
InstructBLIP | 50.0 | 50.0 | 100.0 | 66.6 | 100.0 | ||
LLaVA 1.5 | 59.2 | 86.2 | 21.9 | 35.0 | 18.2 | ||
Out-of-domain | Attribute | mPLUG-Owl | 50.4 | 50.5 | 43.6 | 46.8 | 42.9 |
LLaVA | 51.8 | 66.5 | 9.0 | 15.8 | 6.2 | ||
MiniGPT-4 | 50.0 | 51.5 | 53.6 | 52.6 | 24.7 | ||
InstructBLIP | 50.0 | 50.0 | 100.0 | 66.6 | 100.0 | ||
LLaVA 1.5 | 58.1 | 79.4 | 21.9 | 34.4 | 13.8 | ||
Out-of-domain | Relation | mPLUG-Owl | 50.0 | 50.0 | 43.6 | 46.6 | 43.1 |
LLaVA | 50.8 | 57.1 | 9.0 | 15.5 | 7.8 | ||
MiniGPT-4 | 49.7 | 50.9 | 53.6 | 52.2 | 24.6 | ||
InstructBLIP | 50.0 | 50.0 | 100 | 66.6 | 100.0 | ||
LLaVA 1.5 | 53.7 | 60.2 | 21.9 | 32.2 | 12.7 | ||
Out-of-domain | Event | mPLUG-Owl | 50.1 | 50.1 | 43.6 | 46.6 | 43.3 |
LLaVA | 46.2 | 31.2 | 9.0 | 14.0 | 13.2 | ||
MiniGPT-4 | 49.3 | 52.3 | 53.6 | 53.0 | 24.3 | ||
InstructBLIP | 50.0 | 50.0 | 100 | 66.6 | 99.9 | ||
LLaVA 1.5 | 57.7 | 77.2 | 21.9 | 34.2 | 14.2 |
Regarding generative evaluation, current evaluation methods either rely on proprietary models that require subscription fees, such as GPT-4, or depend on fine-tuned large language models (LLMs) that necessitate additional ground truth annotationsβa process that is prohibitively expensive. This significantly restricts the scalability of evaluating models. In response, we propose Hal-Evaluator, a reference-free, open-source evaluation model designed specifically to detect hallucinatory content. Hal-Evaluator is fine-tuned on LLaVA 1.5 which is also a LVLM, it takes as input the description of an image provided by the LVLMs under evaluation, as well as the corresponding image itself. Following this, it evaluate whether the description contains hallucinations. If hallucinations are detected, it provides the specific content and categorization of the hallucinations. Ultimately, it can even modify the hallucinated information in the description to output an accurate depiction. In this way, our generative evaluation eliminates the need for additional reference annotation, enabling hallucination evaluation based solely on the content of the image.
To train the Hal-Evaluator which capable of effectively identifying different types of hallucinations, a large-scale, fine-grained hallucinatory image-text dataset is necessary as they facilitate the refinement of training for Hal-Evaluator intended to detect and correct hallucinatory content. However, there currently exists no dataset of this scale with detailed annotations. Therefore, we initially constructed Hal-Data, the first large-scale, fine-grained dataset with hallucination annotations, based on the AFHA pipeline.
To maximize the diversity and comprehensiveness of our data, we initially compiled approximately 200K images from various sources, including 80K images from the in-domain COCO dataset and 80K web images, such as those from CC , SBU, and LAION . Additionally, to better align with the style of LVLM outputs, we also collected 40K image-text datasets from ShareGPT4-V. Subsequently, we employed AFHA to annotate this portion of the data, resulting in a final collection of 130K GPT4 meticulously annotated instances and named it as Hal-Data 130k. We release the dataset in Hal-Data
Building upon the Hal-Data 130k dataset, we endeavored to further expand the scale of our dataset. Due to the high cost associated with using GPT-4, we leveraged the Hal-Data 130k dataset to fine-tune the currently open-source large-scale language model LLaMA2 13B, resulting in a hallucination data annotation model named Hal-Annotator. Thanks to its training on diverse and comprehensive data, the Hal-Annotator is capable of generating highly quality and content-related annotations. This approach allows the data scaling phase to proceed without the need for using the paid GPT-4. To accumulate a substantial volume of high-quality image-text pairs, we selected a subset of 2 million image-caption pairs from current public datasets and employed our pre-trained Hal-Annotator to modify the image captions by introducing different types of hallucinations and annotating them. We will release this dataset in the future.
Hal-Evaluator is fine-tuned on LLaVA 1.5 which is also a LVLM; it takes as input the description of an image provided by the LVLMs under evaluation, as well as the corresponding image itself. Following this, it evaluate whether the description contains hallucinations. If hallucinations are detected, it provides the specific content and categorization of the hallucinations. Ultimately, it can even modify the hallucinated information in the description to output an accurate depiction. We have release a subset of our instruction data for Hal-Evaluator, we will release the overall instruction dataset in the future.
You need to prepare the model weight of Hal-Evaluator and run the python file in generative_evaluation. with following command in terminal such as:
python eval_our_model_instructblip.py --model-path hal_eval_model_path --num-gpus 1 --qdir other_model_output_json --odir output_path_json
Model | Length | In-domain | Out-of-domain | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Object Ratio | Relation Ratio | Attribute Ratio | Event Ratio | Acc | Object Ratio | Relation Ratio | Attribute Ratio | Event Ratio | Acc | ||
MiniGPT-4 | 28.7 | 36.6 | 30.6 | 16.5 | 10.6 | 69.3 | 45.5 | 20.8 | 19.2 | 14.6 | 66.5 |
79.6 | 46.2 | 22.5 | 8.0 | 23.4 | 61.4 | 53.7 | 9.7 | 7.2 | 29.6 | 50.1 | |
InstructBLIP | 10.3 | 34.2 | 45.2 | 10.3 | 8.3 | 89.1 | 47.6 | 27.4 | 13.2 | 10.2 | 91.0 |
80.6 | 25.7 | 12.6 | 16.8 | 51.3 | 35.5 | 19.6 | 11.4 | 15.2 | 59.3 | 41.3 | |
mPLUG-owl | 28.3 | 45.5 | 24.6 | 16.3 | 13.4 | 45.4 | 40.5 | 21.2 | 18.5 | 19.4 | 43.5 |
78.3 | 46.2 | 9.5 | 12.5 | 31.7 | 27.3 | 45.9 | 9.3 | 4.6 | 40.2 | 29.5 | |
LLaVA | 37.3 | 40.1 | 18.5 | 4.5 | 37.1 | 47.4 | 34.9 | 23.2 | 24.4 | 17.8 | 46.3 |
88.3 | 45.7 | 9.4 | 3.1 | 42.1 | 23.3 | 38.3 | 7.2 | 2.2 | 52.6 | 26.3 | |
LLaVA1.5 | 10.3 | 23.7 | 58.8 | 10.6 | 7.0 | 55.7 | 30.0 | 48.4 | 11.6 | 10.2 | 49.5 |
84.5 | 42.2 | 13.0 | 3.6 | 41.4 | 44.6 | 34.6 | 8.8 | 2.7 | 54.3 | 46.4 |