Skip to content

PKU-ICST-MIPL/Finedefics_ICLR2025

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 

Repository files navigation

Analyzing and Boosting the Power of Fine-Grained Visual Recognition for Multi-modal Large Language Models

Hulingxiao He · Geng Li · Zijun Geng · Jinglin Xu · Yuxin Peng

ICLR 2025

TL;DR: We revisit three quintessential capabilities of MLLMs for FGVR, including object information extraction, category knowledge reserve, object-category alignment, and position of the root cause as a misalignment problem. To address this issue, we present Finedefics, an MLLM that enhances the model's FGVR capability by incorporating informative attribute descriptions of objects into the training phase.

Logo

📣 News

  • [02/12/2025] We release the model Finedefics and evaluation code.
  • [01/23/2025] Our work is accepted to ICLR 2025 🌼! Code is coming soon. See you in Singapore this April!

📋 Evaluation

We use FOCI-Benchmark to evaluate our model.

1. Preparing the Data

Before starting, we can download & prepare the evaluation datasets we want to use following a guide here.

2. Preparing the Environment

Requirements can be found in requirements.txt. We recommend using Python ≥ 3.9 and PyTorch ≥ 2.2.1.

3. Running the Benchmark

An example of evaluating on dog-120 dataset is:

python run_ic_bench.py --model=/path/to/model --dataset=dog-120 --prompt_query='Which of these dogs is shown in the image?' --image_root=/path/to/dog-120 --batchsize=4

Note: Avaliable datasets are dog-120, bird-200, fgvc_aircraft, flowers102, oxford_pet, stanford_cars, imagenet-rendition, imagenet-adversarial, imagenet-sketch.

See scripts for examples of evaluating Finedefics on all benchmark datasets.

4. Testing New Models

Our code is trivial to extend to new models, especially if they use HuggingFace:

  • Implement the model based on the reference HfModel or the other implemented models.
  • Update model_template() to provide the model instruction template.
  • Update load_model() to load the model based on the name.

5. Testing on New Datasets

Our code is also trivial to extend to new image classification datasets:

  • Implement a loader function that creates a dictionary mapping labels to (relative) image paths and add it to DATASET_TO_LOADER.
  • When running the benchmark for the first time, we use CLIP to find difficult multiple-choice options and store them in data for subsequent runs.

🚩 Acknowledgments

Our code references FineR, FOCI-Benchmark, HACL. Many thanks to the authors.

🗻 Citation

Should you find our paper valuable to your work, we would greatly appreciate it if you could cite it:

@inproceedings{
    he2025analyzing,
    title={Analyzing and Boosting the Power of Fine-Grained Visual Recognition for Multi-modal Large Language Models},
    author={Hulingxiao He and Geng Li and Zijun Geng and Jinglin Xu and Yuxin Peng},
    booktitle={The Thirteenth International Conference on Learning Representations},
    year={2025},
    url={https://openreview.net/forum?id=p3NKpom1VL}
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published