Analyzing and Boosting the Power of Fine-Grained Visual Recognition for Multi-modal Large Language Models

Hulingxiao He · Geng Li · Zijun Geng · Jinglin Xu · Yuxin Peng

ICLR 2025

OpenReview | Paper | Model

TL;DR: We revisit three quintessential capabilities of MLLMs for FGVR, including object information extraction, category knowledge reserve, object-category alignment, and position of the root cause as a misalignment problem. To address this issue, we present Finedefics, an MLLM that enhances the model's FGVR capability by incorporating informative attribute descriptions of objects into the training phase.

📣 News

[02/12/2025] We release the model Finedefics and evaluation code.
[01/23/2025] Our work is accepted to ICLR 2025 🌼! Code is coming soon. See you in Singapore this April!

📋 Evaluation

We use FOCI-Benchmark to evaluate our model.

1. Preparing the Data

Before starting, we can download & prepare the evaluation datasets we want to use following a guide here.

2. Preparing the Environment

Requirements can be found in requirements.txt. We recommend using Python ≥ 3.9 and PyTorch ≥ 2.2.1.

3. Running the Benchmark

An example of evaluating on dog-120 dataset is:

python run_ic_bench.py --model=/path/to/model --dataset=dog-120 --prompt_query='Which of these dogs is shown in the image?' --image_root=/path/to/dog-120 --batchsize=4

Note: Avaliable datasets are dog-120, bird-200, fgvc_aircraft, flowers102, oxford_pet, stanford_cars, imagenet-rendition, imagenet-adversarial, imagenet-sketch.

See scripts for examples of evaluating Finedefics on all benchmark datasets.

4. Testing New Models

Our code is trivial to extend to new models, especially if they use HuggingFace:

Implement the model based on the reference HfModel or the other implemented models.
Update model_template() to provide the model instruction template.
Update load_model() to load the model based on the name.

5. Testing on New Datasets

Our code is also trivial to extend to new image classification datasets:

Implement a loader function that creates a dictionary mapping labels to (relative) image paths and add it to DATASET_TO_LOADER.
When running the benchmark for the first time, we use CLIP to find difficult multiple-choice options and store them in data for subsequent runs.

🚩 Acknowledgments

Our code references FineR, FOCI-Benchmark, HACL. Many thanks to the authors.

🗻 Citation

Should you find our paper valuable to your work, we would greatly appreciate it if you could cite it:

@inproceedings{
    he2025analyzing,
    title={Analyzing and Boosting the Power of Fine-Grained Visual Recognition for Multi-modal Large Language Models},
    author={Hulingxiao He and Geng Li and Zijun Geng and Jinglin Xu and Yuxin Peng},
    booktitle={The Thirteenth International Conference on Learning Representations},
    year={2025},
    url={https://openreview.net/forum?id=p3NKpom1VL}
}

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
FOCI-Benchmark		FOCI-Benchmark
figures		figures
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Analyzing and Boosting the Power of Fine-Grained Visual Recognition for Multi-modal Large Language Models

ICLR 2025

OpenReview | Paper | Model

📣 News

📋 Evaluation

1. Preparing the Data

2. Preparing the Environment

3. Running the Benchmark

4. Testing New Models

5. Testing on New Datasets

🚩 Acknowledgments

🗻 Citation

About

Releases

Packages

Languages

PKU-ICST-MIPL/Finedefics_ICLR2025

Folders and files

Latest commit

History

Repository files navigation

Analyzing and Boosting the Power of Fine-Grained Visual Recognition for Multi-modal Large Language Models

ICLR 2025

OpenReview | Paper | Model

📣 News

📋 Evaluation

1. Preparing the Data

2. Preparing the Environment

3. Running the Benchmark

4. Testing New Models

5. Testing on New Datasets

🚩 Acknowledgments

🗻 Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages