Skip to content

A Universal and Fine-grained Hallucination Evaluation Framework for Large Vision Language Models

License

Notifications You must be signed in to change notification settings

WisdomShell/hal-eval

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

7 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Hal-Eval: A Universal and Fine-grained Hallucination Evaluation Framework for Large Vision Language Models

πŸ€— Hugging Face β€’ 🌐 PKU-KCL β€’ πŸ€– Demo: Hal-Evaluator β€’ β˜• Paper

Please refer to our preprint paper: Hal-Eval: A Universal and Fine-grained Hallucination Evaluation Framework for Large Vision Language Models.

Introduction

Large Vision-Language Models (LVLMs) exhibit remarkable capabilities but struggle with "hallucinations"β€”inconsistencies between images and their descriptions. Previous hallucination evaluation studies on LVLMs have identified hallucinations in terms of objects, attributes, and relations but overlooked complex hallucinations that create an entire narrative around a fictional entity. In this work, We introduce a refined taxonomy of hallucinations, featuring a new category: Event Hallucination. We then utilize advanced LLMs to generate and filter fine-grained hallucinatory data consisting of various types of hallucinations, with a particular focus on event hallucinations, laying the groundwork for integrating discriminative and generative evaluation methods within our universal evaluation framework Hal-Eval. The proposed benchmark distinctively assesses LVLMs' ability to tackle a broad spectrum of hallucinations, making it a reliable and comprehensive tool for gauging LVLMs' efficacy in handling hallucinations.

Compared with Other Hallucination Benchmark

Benchmark Tasks Discriminative Hallucination Generative Hallucination
Dis Gen Object Attribute Relation Event Object Attribute Relation Event
POPE βœ”οΈ ❌ βœ”οΈ ❌ ❌ ❌ ❌ ❌ ❌ ❌
NOPE βœ”οΈ ❌ βœ”οΈ ❌ ❌ ❌ ❌ ❌ ❌ ❌
CIEM βœ”οΈ ❌ βœ”οΈ ❌ ❌ ❌ ❌ ❌ ❌ ❌
M-HalDetect ❌ βœ”οΈ ❌ ❌ ❌ ❌ βœ”οΈ βœ”οΈ βœ”οΈ ❌
GAVIE ❌ βœ”οΈ ❌ ❌ ❌ ❌ βœ”οΈ βœ”οΈ ❌ ❌
FAITHScore ❌ βœ”οΈ ❌ ❌ ❌ ❌ βœ”οΈ βœ”οΈ βœ”οΈ ❌
MMhal-Bench ❌ βœ”οΈ ❌ ❌ ❌ - - - - ❌
HaELM ❌ βœ”οΈ ❌ ❌ ❌ - - - - ❌
AMBER βœ”οΈ βœ”οΈ βœ”οΈ βœ”οΈ βœ”οΈ ❌ βœ”οΈ ❌ ❌ ❌
Hal-Eval βœ”οΈ βœ”οΈ βœ”οΈ βœ”οΈ βœ”οΈ βœ”οΈ βœ”οΈ βœ”οΈ βœ”οΈ βœ”οΈ

AFHA: Automatic Fine-grained Hallucination Annotation Pipline

The existing multimodal hallucination research suffers from a lack of large-scale datasets with fine-grained annotations specific to hallucinations. To address this issue, we design AFHA, an Automatic Fine-grained Hallucination Annotation pipline featuring annotations for four hallucination types and specific hallucination content. For more details, please refer to our paper paper

Evaluation

Hal-Eval is divided into two distinct segments: Discriminative Evaluation and Generative Evaluation. We have opted to assess five widely utilized open-source LVLMs: MiniGPT-4 , InstructBLIP , mPLUG-owl, LLaVA, LLaVA1.5.

Discriminative Evalutation

Evaluation Dataset

Our evaluation dataset is split into two parts. One part consists of in-domain evaluation data, composed of image-text pairs from COCO 2014 validation and COCO 2017 test sets. The other part is sourced from web-based datasets such as CC, SBU , and LAION, providing out-of-domain data sampled randomly. We provide the evaluation data.

Evaluation Script

Please set the parameters in the 5k_code.py and run the following command in terminal:

CUDA_VISIBLE_DEVICES=0 python 5k_code.py

Evaluation Results

Dataset Type Model Accuracy Precision Recall F1 Yes (%)
In-domain Object mPLUG-Owl 49.8 49.8 44.7 47.1 44.1
LLaVA 52.6 55.5 26.3 35.7 23.6
MiniGPT-4 50.4 50.3 46.5 48.3 40.2
InstructBLIP 50.0 50.0 99.0 66.5 98.0
LLaVA 1.5 62.2 76.1 35.6 48.5 23.3
In-domain Attribute mPLUG-Owl 49.9 49.9 44.7 47.2 44.6
LLaVA 52.8 55.9 26.3 35.8 23.5
MiniGPT-4 51.1 51.1 46.5 48.7 39.4
InstructBLIP 49.8 49.8 99.0 66.3 98.1
LLaVA 1.5 62.2 76.1 35.6 48.5 23.3
In-domain Relation mPLUG-Owl 50.4 50.5 44.7 47.4 44.7
LLaVA 52.7 55.7 26.3 35.8 23.7
MiniGPT-4 50.4 50.1 46.5 48.2 40.0
InstructBLIP 49.8 49.9 99.0 66.3 97.7
LLaVA 1.5 55.4 59.1 35.6 44.4 22.1
In-domain Event mPLUG-Owl 49.7 49.7 44.6 47.0 44.8
LLaVA 51.5 53.0 26.3 35.1 24.8
MiniGPT-4 49.6 50.0 46.5 48.2 40.3
InstructBLIP 49.6 49.7 99.0 66.2 84.3
LLaVA 1.5 62.7 77.9 35.6 48.9 22.8
Dataset Type Model Accuracy Precision Recall F1 Yes (%)
Out-of-domain Object mPLUG-Owl 50.3 50.4 43.6 46.8 43.4
LLaVA 50.7 52.7 9.0 15.3 7.2
MiniGPT-4 50.3 51.7 53.6 52.6 25.0
InstructBLIP 50.0 50.0 100.0 66.6 100.0
LLaVA 1.5 59.2 86.2 21.9 35.0 18.2
Out-of-domain Attribute mPLUG-Owl 50.4 50.5 43.6 46.8 42.9
LLaVA 51.8 66.5 9.0 15.8 6.2
MiniGPT-4 50.0 51.5 53.6 52.6 24.7
InstructBLIP 50.0 50.0 100.0 66.6 100.0
LLaVA 1.5 58.1 79.4 21.9 34.4 13.8
Out-of-domain Relation mPLUG-Owl 50.0 50.0 43.6 46.6 43.1
LLaVA 50.8 57.1 9.0 15.5 7.8
MiniGPT-4 49.7 50.9 53.6 52.2 24.6
InstructBLIP 50.0 50.0 100 66.6 100.0
LLaVA 1.5 53.7 60.2 21.9 32.2 12.7
Out-of-domain Event mPLUG-Owl 50.1 50.1 43.6 46.6 43.3
LLaVA 46.2 31.2 9.0 14.0 13.2
MiniGPT-4 49.3 52.3 53.6 53.0 24.3
InstructBLIP 50.0 50.0 100 66.6 99.9
LLaVA 1.5 57.7 77.2 21.9 34.2 14.2

Generative Evaluation

Regarding generative evaluation, current evaluation methods either rely on proprietary models that require subscription fees, such as GPT-4, or depend on fine-tuned large language models (LLMs) that necessitate additional ground truth annotationsβ€”a process that is prohibitively expensive. This significantly restricts the scalability of evaluating models. In response, we propose Hal-Evaluator, a reference-free, open-source evaluation model designed specifically to detect hallucinatory content. Hal-Evaluator is fine-tuned on LLaVA 1.5 which is also a LVLM, it takes as input the description of an image provided by the LVLMs under evaluation, as well as the corresponding image itself. Following this, it evaluate whether the description contains hallucinations. If hallucinations are detected, it provides the specific content and categorization of the hallucinations. Ultimately, it can even modify the hallucinated information in the description to output an accurate depiction. In this way, our generative evaluation eliminates the need for additional reference annotation, enabling hallucination evaluation based solely on the content of the image.

To train the Hal-Evaluator which capable of effectively identifying different types of hallucinations, a large-scale, fine-grained hallucinatory image-text dataset is necessary as they facilitate the refinement of training for Hal-Evaluator intended to detect and correct hallucinatory content. However, there currently exists no dataset of this scale with detailed annotations. Therefore, we initially constructed Hal-Data, the first large-scale, fine-grained dataset with hallucination annotations, based on the AFHA pipeline.

Hal-Data

Hal-Data 130K

To maximize the diversity and comprehensiveness of our data, we initially compiled approximately 200K images from various sources, including 80K images from the in-domain COCO dataset and 80K web images, such as those from CC , SBU, and LAION . Additionally, to better align with the style of LVLM outputs, we also collected 40K image-text datasets from ShareGPT4-V. Subsequently, we employed AFHA to annotate this portion of the data, resulting in a final collection of 130K GPT4 meticulously annotated instances and named it as Hal-Data 130k. We release the dataset in Hal-Data

Hal-Data 2M.

Building upon the Hal-Data 130k dataset, we endeavored to further expand the scale of our dataset. Due to the high cost associated with using GPT-4, we leveraged the Hal-Data 130k dataset to fine-tune the currently open-source large-scale language model LLaMA2 13B, resulting in a hallucination data annotation model named Hal-Annotator. Thanks to its training on diverse and comprehensive data, the Hal-Annotator is capable of generating highly quality and content-related annotations. This approach allows the data scaling phase to proceed without the need for using the paid GPT-4. To accumulate a substantial volume of high-quality image-text pairs, we selected a subset of 2 million image-caption pairs from current public datasets and employed our pre-trained Hal-Annotator to modify the image captions by introducing different types of hallucinations and annotating them. We will release this dataset in the future.

Hal-Evaluator

Hal-Evaluator is fine-tuned on LLaVA 1.5 which is also a LVLM; it takes as input the description of an image provided by the LVLMs under evaluation, as well as the corresponding image itself. Following this, it evaluate whether the description contains hallucinations. If hallucinations are detected, it provides the specific content and categorization of the hallucinations. Ultimately, it can even modify the hallucinated information in the description to output an accurate depiction. We have release a subset of our instruction data for Hal-Evaluator, we will release the overall instruction dataset in the future.

Evaluation Script:

You need to prepare the model weight of Hal-Evaluator and run the python file in generative_evaluation. with following command in terminal such as:

python eval_our_model_instructblip.py --model-path hal_eval_model_path --num-gpus 1 --qdir other_model_output_json --odir output_path_json

Evaluation Results

Model Length In-domain Out-of-domain
Object Ratio Relation Ratio Attribute Ratio Event Ratio Acc Object Ratio Relation Ratio Attribute Ratio Event Ratio Acc
MiniGPT-4 28.7 36.6 30.6 16.5 10.6 69.3 45.5 20.8 19.2 14.6 66.5
79.6 46.2 22.5 8.0 23.4 61.4 53.7 9.7 7.2 29.6 50.1
InstructBLIP 10.3 34.2 45.2 10.3 8.3 89.1 47.6 27.4 13.2 10.2 91.0
80.6 25.7 12.6 16.8 51.3 35.5 19.6 11.4 15.2 59.3 41.3
mPLUG-owl 28.3 45.5 24.6 16.3 13.4 45.4 40.5 21.2 18.5 19.4 43.5
78.3 46.2 9.5 12.5 31.7 27.3 45.9 9.3 4.6 40.2 29.5
LLaVA 37.3 40.1 18.5 4.5 37.1 47.4 34.9 23.2 24.4 17.8 46.3
88.3 45.7 9.4 3.1 42.1 23.3 38.3 7.2 2.2 52.6 26.3
LLaVA1.5 10.3 23.7 58.8 10.6 7.0 55.7 30.0 48.4 11.6 10.2 49.5
84.5 42.2 13.0 3.6 41.4 44.6 34.6 8.8 2.7 54.3 46.4

About

A Universal and Fine-grained Hallucination Evaluation Framework for Large Vision Language Models

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published