Benchmarking Large Vision-Language Models via Directed Scene Graph
for Comprehensive Image Captioning

Fan Lu*, Wei Wu*, Kecheng Zheng, Shuailei Ma, Biao Gong,
Jiawei Liu, Wei Zhai, Cao Yang, Yujun Shen, Zheng-Jun Zha ✉️

⚙️ Installation

Clone the repository:

git clone https://github.com/LuFan31/CompreCap.git
cd CompreCap

Build environment:

conda create -n CompreCap python=3.9
conda activate CompreCap
pip install -r requirements.txt
pip install spacy
python -m spacy download en_core_web_lg

Download the weights of parser and LLM evaluator:

Download the weights of Sentence-Transformers and Llama3 from huggingface.

⏬ Download our benchmark

Our CompreCap benchmark is available at 🤗CompreCap. Our benchmark is formed as follows:

├── CompreCap_dataset
│   ├── images
|   |   ├── 000000000802.jpg
|   |   └── ...
│   ├── QA_json
|   |   ├── finegri_desc_qa.jsonl
|   |   └── finegri_visible_qa_hulla.jsonl
│   └── anno.json

We require 10 popular LVLMs to generate captions for the images in the images folder.

🎬️ Evaluate the detailed captions generated by LVLMs with Llmam3

You can run the following command to assess the detailed captions, using the Llama3 as the evaluator. The results will be saved in the output-dir.

python evaluate.py \
  --data_root CompreCap_dataset/ \
  --anno CompreCap_dataset/anno.json \
  --soft_coverage \
  --llm PATH/to/Meta-Llama-3-8B-Instruct \
  --bert PATH/to/sentence-transformers--all-MiniLM-L6-v2 \
  --longcap_file longcap_from_MLLM/LLava-Next-34B.json \
  --output-dir output/LLava-Next-34B_result.json

The file for storing captions to be evaluated is specified by longcap_file, where the data is saved in the following format:

{
'image_name1': 'caption1',
'image_name2': 'caption2',
...
}

🎬️ Evaluate fine-grained object VQA

The fine-grained object VQA consists of CompreQA-P and CompreQA-Cap. The question-answer pairs are stored in .json files, which are placed in CompreCap_dataset/QA_json. You can evaluate LVLMs' performance in fine-grained object VQA with the following code example:

CLICK for the code example

image_dir='CompreCap_dataset/images'
with open('result.jsonl', 'w') as outfile:
    with open ('CompreCap_dataset/QA_json/finegri_desc_qa.jsonl', 'r') as infile: # use finegri_visible_qa.jsonl to evaluate CompreQA-P
        correct, num_total = 0, 0
        for line in infile:
            qa_line = json.loads(line)
            image_name = qa_line['image_name']
            object_category = qa_line['category']
            question = qa_line['question']
            question = question+'Just answer with the option\'s letter'
            answer = qa_line['answer']
            image_path = os.path.join(image_dir, image_name)
            image = Image.open(image_path).convert('RGB')
            image_tensor = process_images(image).cuda()
            response = multimodal_model.generate(tokenizer, image_tensor, question)
            option_letter_regex = re.compile(r"^(A|B|C)$", re.IGNORECASE)
            if option_letter_regex.match(response):
                response = response.upper()
            else:
                response = response.split('.')[0]
                if option_letter_regex.match(response):
                    response = response.upper()
                else:
                    print('Could not determine A or B or C.')
            print(response)
            num_total += 1
            if response == answer:
                correct += 1
                Correct = 'Yes!'
            else:
                Correct = 'No!'
            if response != 'D':
                prediction = question.split(f'{response}. ')[-1].split('.')[0]
            else:
                prediction = f"The '{object_category}' is not visible!"
            answer_desc = question.split(f'{answer}. ')[-1].split('.')[0]
            qa_line = {'image_name': image_name, 'category': object_category, 'prediction': prediction, 'answer': answer_desc,  'correct': Correct}
            json_qa_line = json.dumps(qa_line)
            outfile.write(json_qa_line + '\n')
            outfile.flush()
        print(f'correct rate: {correct/num_total}')

📖 Citation

Don't forget to cite this source if it proves useful in your research!

@article{CompreCap,
  title={Benchmarking Large Vision-Language Models via Directed Scene Graph for Comprehensive Image Captioning},
  author={Fan Lu, Wei Wu, Kecheng Zheng, Shuailei Ma, Biao Gong, Jiawei Liu, Wei Zhai, Yang Cao, Yujun Shen, Zheng-Jun Zha},
  booktitle={arXiv},
  year={2024}
}

✨️ Acknowledgement

The data source of our CompreCap is based on the panoptic segmentation dataset in MSCOCO. Thanks for its remarkable contribution!

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
dataset		dataset
fig		fig
longcap_from_MLLM		longcap_from_MLLM
utils		utils
.DS_Store		.DS_Store
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
evaluate.py		evaluate.py
install.md		install.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Benchmarking Large Vision-Language Models via Directed Scene Graph
for Comprehensive Image Captioning

⚙️ Installation

⏬ Download our benchmark

🎬️ Evaluate the detailed captions generated by LVLMs with Llmam3

🎬️ Evaluate fine-grained object VQA

📖 Citation

✨️ Acknowledgement

About

Releases

Packages

Languages

License

LuFan31/CompreCap

Folders and files

Latest commit

History

Repository files navigation

Benchmarking Large Vision-Language Models via Directed Scene Graph for Comprehensive Image Captioning

⚙️ Installation

⏬ Download our benchmark

🎬️ Evaluate the detailed captions generated by LVLMs with Llmam3

🎬️ Evaluate fine-grained object VQA

📖 Citation

✨️ Acknowledgement

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Benchmarking Large Vision-Language Models via Directed Scene Graph
for Comprehensive Image Captioning

Packages