Hal-Eval: A Universal and Fine-grained Hallucination Evaluation Framework for Large Vision Language Models

🤗 Hugging Face • 🌐 PKU-KCL • 🤖 Demo: Hal-Evaluator • ☕ Paper

Please refer to our preprint paper: Hal-Eval: A Universal and Fine-grained Hallucination Evaluation Framework for Large Vision Language Models.

Introduction

Large Vision-Language Models (LVLMs) exhibit remarkable capabilities but struggle with "hallucinations"—inconsistencies between images and their descriptions. Previous hallucination evaluation studies on LVLMs have identified hallucinations in terms of objects, attributes, and relations but overlooked complex hallucinations that create an entire narrative around a fictional entity. In this work, We introduce a refined taxonomy of hallucinations, featuring a new category: Event Hallucination. We then utilize advanced LLMs to generate and filter fine-grained hallucinatory data consisting of various types of hallucinations, with a particular focus on event hallucinations, laying the groundwork for integrating discriminative and generative evaluation methods within our universal evaluation framework Hal-Eval. The proposed benchmark distinctively assesses LVLMs' ability to tackle a broad spectrum of hallucinations, making it a reliable and comprehensive tool for gauging LVLMs' efficacy in handling hallucinations.

Compared with Other Hallucination Benchmark

Benchmark	Tasks		Discriminative Hallucination				Generative Hallucination
	Dis	Gen	Object	Attribute	Relation	Event	Object	Attribute	Relation	Event
POPE	✔️	❌	✔️	❌	❌	❌	❌	❌	❌	❌
NOPE	✔️	❌	✔️	❌	❌	❌	❌	❌	❌	❌
CIEM	✔️	❌	✔️	❌	❌	❌	❌	❌	❌	❌
M-HalDetect	❌	✔️	❌	❌	❌	❌	✔️	✔️	✔️	❌
GAVIE	❌	✔️	❌	❌	❌	❌	✔️	✔️	❌	❌
FAITHScore	❌	✔️	❌	❌	❌	❌	✔️	✔️	✔️	❌
MMhal-Bench	❌	✔️	❌	❌	❌	-	-	-	-	❌
HaELM	❌	✔️	❌	❌	❌	-	-	-	-	❌
AMBER	✔️	✔️	✔️	✔️	✔️	❌	✔️	❌	❌	❌
Hal-Eval	✔️	✔️	✔️	✔️	✔️	✔️	✔️	✔️	✔️	✔️

AFHA: Automatic Fine-grained Hallucination Annotation Pipline

The existing multimodal hallucination research suffers from a lack of large-scale datasets with fine-grained annotations specific to hallucinations. To address this issue, we design AFHA, an Automatic Fine-grained Hallucination Annotation pipline featuring annotations for four hallucination types and specific hallucination content. For more details, please refer to our paper paper

Evaluation

Hal-Eval is divided into two distinct segments: Discriminative Evaluation and Generative Evaluation. We have opted to assess five widely utilized open-source LVLMs: MiniGPT-4 , InstructBLIP , mPLUG-owl, LLaVA, LLaVA1.5.

Discriminative Evalutation

Evaluation Dataset

Our evaluation dataset is split into two parts. One part consists of in-domain evaluation data, composed of image-text pairs from COCO 2014 validation and COCO 2017 test sets. The other part is sourced from web-based datasets such as CC, SBU , and LAION, providing out-of-domain data sampled randomly. We provide the evaluation data.

Evaluation Script

Please set the parameters in the 5k_code.py and run the following command in terminal:

CUDA_VISIBLE_DEVICES=0 python 5k_code.py

Evaluation Results

Dataset	Type	Model	Accuracy	Precision	Recall	F1	Yes (%)
In-domain	Object	mPLUG-Owl	49.8	49.8	44.7	47.1	44.1
		LLaVA	52.6	55.5	26.3	35.7	23.6
		MiniGPT-4	50.4	50.3	46.5	48.3	40.2
		InstructBLIP	50.0	50.0	99.0	66.5	98.0
		LLaVA 1.5	62.2	76.1	35.6	48.5	23.3
		In-domain	Attribute	mPLUG-Owl	49.9	49.9	44.7	47.2	44.6
	LLaVA			52.8	55.9	26.3	35.8	23.5
	MiniGPT-4			51.1	51.1	46.5	48.7	39.4
	InstructBLIP			49.8	49.8	99.0	66.3	98.1
	LLaVA 1.5			62.2	76.1	35.6	48.5	23.3
	In-domain			Relation	mPLUG-Owl	50.4	50.5	44.7	47.4	44.7
					LLaVA	52.7	55.7	26.3	35.8	23.7
			MiniGPT-4		50.4	50.1	46.5	48.2	40.0
			InstructBLIP		49.8	49.9	99.0	66.3	97.7
			LLaVA 1.5		55.4	59.1	35.6	44.4	22.1
			In-domain	Event	mPLUG-Owl	49.7	49.7	44.6	47.0	44.8
					LLaVA	51.5	53.0	26.3	35.1	24.8
					MiniGPT-4	49.6	50.0	46.5	48.2	40.3
					InstructBLIP	49.6	49.7	99.0	66.2	84.3
					LLaVA 1.5	62.7	77.9	35.6	48.9	22.8

Dataset	Type	Model	Accuracy	Precision	Recall	F1	Yes (%)
Out-of-domain	Object	mPLUG-Owl	50.3	50.4	43.6	46.8	43.4
		LLaVA	50.7	52.7	9.0	15.3	7.2
		MiniGPT-4	50.3	51.7	53.6	52.6	25.0
		InstructBLIP	50.0	50.0	100.0	66.6	100.0
		LLaVA 1.5	59.2	86.2	21.9	35.0	18.2
Out-of-domain	Attribute	mPLUG-Owl	50.4	50.5	43.6	46.8	42.9
		LLaVA	51.8	66.5	9.0	15.8	6.2
		MiniGPT-4	50.0	51.5	53.6	52.6	24.7
		InstructBLIP	50.0	50.0	100.0	66.6	100.0
		LLaVA 1.5	58.1	79.4	21.9	34.4	13.8
Out-of-domain	Relation	mPLUG-Owl	50.0	50.0	43.6	46.6	43.1
		LLaVA	50.8	57.1	9.0	15.5	7.8
		MiniGPT-4	49.7	50.9	53.6	52.2	24.6
		InstructBLIP	50.0	50.0	100	66.6	100.0
		LLaVA 1.5	53.7	60.2	21.9	32.2	12.7
Out-of-domain	Event	mPLUG-Owl	50.1	50.1	43.6	46.6	43.3
		LLaVA	46.2	31.2	9.0	14.0	13.2
		MiniGPT-4	49.3	52.3	53.6	53.0	24.3
		InstructBLIP	50.0	50.0	100	66.6	99.9
		LLaVA 1.5	57.7	77.2	21.9	34.2	14.2

Generative Evaluation

Regarding generative evaluation, current evaluation methods either rely on proprietary models that require subscription fees, such as GPT-4, or depend on fine-tuned large language models (LLMs) that necessitate additional ground truth annotations—a process that is prohibitively expensive. This significantly restricts the scalability of evaluating models. In response, we propose Hal-Evaluator, a reference-free, open-source evaluation model designed specifically to detect hallucinatory content. Hal-Evaluator is fine-tuned on LLaVA 1.5 which is also a LVLM, it takes as input the description of an image provided by the LVLMs under evaluation, as well as the corresponding image itself. Following this, it evaluate whether the description contains hallucinations. If hallucinations are detected, it provides the specific content and categorization of the hallucinations. Ultimately, it can even modify the hallucinated information in the description to output an accurate depiction. In this way, our generative evaluation eliminates the need for additional reference annotation, enabling hallucination evaluation based solely on the content of the image.

To train the Hal-Evaluator which capable of effectively identifying different types of hallucinations, a large-scale, fine-grained hallucinatory image-text dataset is necessary as they facilitate the refinement of training for Hal-Evaluator intended to detect and correct hallucinatory content. However, there currently exists no dataset of this scale with detailed annotations. Therefore, we initially constructed Hal-Data, the first large-scale, fine-grained dataset with hallucination annotations, based on the AFHA pipeline.

Hal-Data

Hal-Data 130K

To maximize the diversity and comprehensiveness of our data, we initially compiled approximately 200K images from various sources, including 80K images from the in-domain COCO dataset and 80K web images, such as those from CC , SBU, and LAION . Additionally, to better align with the style of LVLM outputs, we also collected 40K image-text datasets from ShareGPT4-V. Subsequently, we employed AFHA to annotate this portion of the data, resulting in a final collection of 130K GPT4 meticulously annotated instances and named it as Hal-Data 130k. We release the dataset in Hal-Data

Hal-Data 2M.

Building upon the Hal-Data 130k dataset, we endeavored to further expand the scale of our dataset. Due to the high cost associated with using GPT-4, we leveraged the Hal-Data 130k dataset to fine-tune the currently open-source large-scale language model LLaMA2 13B, resulting in a hallucination data annotation model named Hal-Annotator. Thanks to its training on diverse and comprehensive data, the Hal-Annotator is capable of generating highly quality and content-related annotations. This approach allows the data scaling phase to proceed without the need for using the paid GPT-4. To accumulate a substantial volume of high-quality image-text pairs, we selected a subset of 2 million image-caption pairs from current public datasets and employed our pre-trained Hal-Annotator to modify the image captions by introducing different types of hallucinations and annotating them. We will release this dataset in the future.

Hal-Evaluator

Hal-Evaluator is fine-tuned on LLaVA 1.5 which is also a LVLM; it takes as input the description of an image provided by the LVLMs under evaluation, as well as the corresponding image itself. Following this, it evaluate whether the description contains hallucinations. If hallucinations are detected, it provides the specific content and categorization of the hallucinations. Ultimately, it can even modify the hallucinated information in the description to output an accurate depiction. We have release a subset of our instruction data for Hal-Evaluator, we will release the overall instruction dataset in the future.

Evaluation Script:

You need to prepare the model weight of Hal-Evaluator and run the python file in generative_evaluation. with following command in terminal such as:

python eval_our_model_instructblip.py --model-path hal_eval_model_path --num-gpus 1 --qdir other_model_output_json --odir output_path_json

Evaluation Results

Model	Length	In-domain					Out-of-domain
Model	Length	Object Ratio	Relation Ratio	Attribute Ratio	Event Ratio	Acc	Object Ratio	Relation Ratio	Attribute Ratio	Event Ratio	Acc
MiniGPT-4	28.7	36.6	30.6	16.5	10.6	69.3	45.5	20.8	19.2	14.6	66.5
	79.6	46.2	22.5	8.0	23.4	61.4	53.7	9.7	7.2	29.6	50.1
InstructBLIP	10.3	34.2	45.2	10.3	8.3	89.1	47.6	27.4	13.2	10.2	91.0
	80.6	25.7	12.6	16.8	51.3	35.5	19.6	11.4	15.2	59.3	41.3
mPLUG-owl	28.3	45.5	24.6	16.3	13.4	45.4	40.5	21.2	18.5	19.4	43.5
	78.3	46.2	9.5	12.5	31.7	27.3	45.9	9.3	4.6	40.2	29.5
LLaVA	37.3	40.1	18.5	4.5	37.1	47.4	34.9	23.2	24.4	17.8	46.3
	88.3	45.7	9.4	3.1	42.1	23.3	38.3	7.2	2.2	52.6	26.3
LLaVA1.5	10.3	23.7	58.8	10.6	7.0	55.7	30.0	48.4	11.6	10.2	49.5
	84.5	42.2	13.0	3.6	41.4	44.6	34.6	8.8	2.7	54.3	46.4

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
Hal_Evaluator		Hal_Evaluator
evaluation		evaluation
evaluation_dataset		evaluation_dataset
images		images
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Hal-Eval: A Universal and Fine-grained Hallucination Evaluation Framework for Large Vision Language Models

Introduction

Compared with Other Hallucination Benchmark

AFHA: Automatic Fine-grained Hallucination Annotation Pipline

Evaluation

Discriminative Evalutation

Evaluation Dataset

Evaluation Script

Evaluation Results

Generative Evaluation

Hal-Data

Hal-Data 130K

Hal-Data 2M.

Hal-Evaluator

Evaluation Script:

Evaluation Results

About

Releases

Packages

Contributors 2

Languages

License

WisdomShell/hal-eval

Folders and files

Latest commit

History

Repository files navigation

Hal-Eval: A Universal and Fine-grained Hallucination Evaluation Framework for Large Vision Language Models

Introduction

Compared with Other Hallucination Benchmark

AFHA: Automatic Fine-grained Hallucination Annotation Pipline

Evaluation

Discriminative Evalutation

Evaluation Dataset

Evaluation Script

Evaluation Results

Generative Evaluation

Hal-Data

Hal-Data 130K

Hal-Data 2M.

Hal-Evaluator

Evaluation Script:

Evaluation Results

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages