Multimodal large language models (MLLMs) offer a powerful mechanism for interpreting visual information. However, they often suffer from hallucinations, which impede the real-world usage of these models. Existing methods attempt to alleviate this issue by designing special decoding strategies that penalize the summary tokens. However, these methods lack analysis of the relationship between hallucination and the summarization mechanism of LLMs. Interestingly, we find that penalizing summary tokens is not necessary: merely intervening the query-key parameters variance, without costing extra inference time, still alleviates hallucinations. Specifically, we explore the causes of hallucinations by analyzing localized self-attention patterns called “anchor” tokens and define the attention localization degree of the model as token propagation probabilities. Our analysis reveals that over-propagation of anchor tokens occurs when the distribution of eigenvalues of the query and key matrices has a non-zero mean and a polarized variance, leading to excessive dependence on anchor tokens while neglecting vision information and describing the image content with hallucination. Based on this observation, we propose a versatile plug-and-play decoding strategy, Dynamic Token Propagation Mechanism (TAME), to alleviate excessive propagation by dynamically intervening the eigenspectrum variance of the attention weight, thereby alleviating hallucinations without relying on complex decoding strategies. Extensive experiments reveal a correlation between the eigenspectrum and hallucinations across various MLLMs and show that TAME reduces the percentage of hallucinated objects.
As we design the LVLMs decoding strategy, it is convenient to use ANTRP by installing our modified transformers
package.
conda env create -f environment.yml
conda activate ANTRP
python -m pip install -e transformers
After setup the environment, you can directly use our code base to imply our ANTRP:
python pope_eval.py --pope-type coco_adversarial --model llava-1.5 --beam 5 --opera #OPERA
python pope_eval.py --pope-type coco_adversarial --model llava-1.5 --use-cd --use-fast-v --sample --sample-greedy #SID_greedy
python pope_eval.py --pope-type coco_adversarial --model llava-1.5 --use-vcd --sample --sample-greedy #VCD_greedy
python pope_eval.py --pope-type coco_adversarial --model llava-1.5 --use-icd --sample --sample-greedy #ICD_greedy
python pope_eval.py --pope-type coco_adversarial --model llava-1.5 --beam 5 #Beam Search
The CHAIR metric utilizes the same configuration.
We provide extensive evaluation metrics including GPT-4V eval_utils/gpt4v_eval.py
, GPT4 shr_eval.py
, POPE pope_eval.py
, CHAIR eval_utils/chair_eval.py
The following evaluation requires for MSCOCO 2014 / Visual Genome dataset. For Visual Genome dataset, please download here dataset/download_visual_genome_v1.2.py
and extract it in the data path.
Besides, it needs you to prepare the following checkpoints of 7B base models:
- Download LLaVA-1.5 merged 7B model and specify it at
eval_configs/llava-1.5_eval.yaml
. - Download Vicuna 7B v1.1 model and specify it at
minigpt4/configs/models/blip2_instruct_vicuna7b.yaml
. - Download Shikra merged 7B model and specify it at
eval_configs/shikra_eval.yaml
. - Download MiniGPT-4 7B pretrained weights and specify it at Line 8 of
eval_configs/minigpt4_eval.yaml
.
Argument | Example | Description |
---|---|---|
--model |
llava-1.5 |
Specify the LVLM model. |
--data-path |
/path/to/dataset |
Path to the dataset file or folder. |
--pope-type |
coco_adversarial |
Type for POPE evaluation. |
--sample |
store_true |
Use the modified decoding strategy. |
--sample-greedy |
store_true |
Use CD with sampling and greedy decoding. |
--beam |
5 |
Beam search number. |
--opera |
store_true |
Use OPERA. |
This repo is based on the LVLMs codebase of SID, OPERA, VCD, and HA-DPO . Thanks for their excellent works!