Analyzing and Boosting the Power of Fine-Grained Visual Recognition for Multi-modal Large Language Models
Hulingxiao He · Geng Li · Zijun Geng · Jinglin Xu · Yuxin Peng
OpenReview | Paper | Model
TL;DR: We revisit three quintessential capabilities of MLLMs for FGVR, including object information extraction, category knowledge reserve, object-category alignment, and position of the root cause as a misalignment problem. To address this issue, we present Finedefics, an MLLM that enhances the model's FGVR capability by incorporating informative attribute descriptions of objects into the training phase.

- [02/12/2025] We release the model Finedefics and evaluation code.
- [01/23/2025] Our work is accepted to ICLR 2025 🌼! Code is coming soon. See you in Singapore this April!
We use FOCI-Benchmark to evaluate our model.
Before starting, we can download & prepare the evaluation datasets we want to use following a guide here.
Requirements can be found in requirements.txt. We recommend using Python ≥ 3.9 and PyTorch ≥ 2.2.1.
An example of evaluating on dog-120
dataset is:
python run_ic_bench.py --model=/path/to/model --dataset=dog-120 --prompt_query='Which of these dogs is shown in the image?' --image_root=/path/to/dog-120 --batchsize=4
Note: Avaliable datasets are dog-120, bird-200, fgvc_aircraft, flowers102, oxford_pet, stanford_cars, imagenet-rendition, imagenet-adversarial, imagenet-sketch
.
See scripts for examples of evaluating Finedefics on all benchmark datasets.
Our code is trivial to extend to new models, especially if they use HuggingFace:
- Implement the model based on the reference HfModel or the other implemented models.
- Update model_template() to provide the model instruction template.
- Update load_model() to load the model based on the name.
Our code is also trivial to extend to new image classification datasets:
- Implement a loader function that creates a dictionary mapping labels to (relative) image paths and add it to DATASET_TO_LOADER.
- When running the benchmark for the first time, we use CLIP to find difficult multiple-choice options and store them in data for subsequent runs.
Our code references FineR, FOCI-Benchmark, HACL. Many thanks to the authors.
Should you find our paper valuable to your work, we would greatly appreciate it if you could cite it:
@inproceedings{
he2025analyzing,
title={Analyzing and Boosting the Power of Fine-Grained Visual Recognition for Multi-modal Large Language Models},
author={Hulingxiao He and Geng Li and Zijun Geng and Jinglin Xu and Yuxin Peng},
booktitle={The Thirteenth International Conference on Learning Representations},
year={2025},
url={https://openreview.net/forum?id=p3NKpom1VL}
}