Recent advances in large vision-language models have revolutionized the image classification paradigm. Despite showing impressive zero-shot capabilities, a pre-defined set of categories, a.k.a. the vocabulary, is assumed at test time for composing the textual prompts. However, such assumption can be impractical when the semantic context is unknown and evolving. We thus formalize a novel task, termed as Vocabulary-free Image Classification (VIC), where we aim to assign to an input image a class that resides in an unconstrained language-induced semantic space, without the prerequisite of a known vocabulary. VIC is a challenging task as the semantic space is extremely large, containing millions of concepts, with hard-to-discriminate fine-grained categories.
In this work, we first empirically verify that representing this semantic space by means of an external vision-language database is the most effective way to obtain semantically relevant content for classifying the image. We then propose Category Search from External Databases (CaSED), a method that exploits a pre-trained vision-language model and an external vision-language database to address VIC in a training-free manner. CaSED first extracts a set of candidate categories from captions retrieved from the database based on their semantic similarity to the image, and then assigns to the image the best matching candidate category according to the same vision-language model. Experiments on benchmark datasets validate that CaSED outperforms other complex vision-language frameworks, while being efficient with much fewer parameters, paving the way for future research in this direction.
Our model CaSED is available on HuggingFace Hub. You can try it directly from the demo or import it from the transformers
library.
To use the model from the HuggingFace Hub, you can use the following snippet:
import requests
from PIL import Image
from transformers import AutoModel, CLIPProcessor
# download an image from the internet
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
# load the model and the processor
model = AutoModel.from_pretrained("altndrr/cased", trust_remote_code=True)
processor = CLIPProcessor.from_pretrained("openai/clip-vit-large-patch14")
# get the model outputs
images = processor(images=[image], return_tensors="pt", padding=True)
outputs = model(images, alpha=0.7)
labels, scores = outputs["vocabularies"][0], outputs["scores"][0]
# print the top 5 most likely labels for the image
values, indices = scores.topk(3)
print("\nTop predictions:\n")
for value, index in zip(values, indices):
print(f"{labels[index]:>16s}: {100 * value.item():.2f}%")
Note that our model depends on some libraries you have to install manually. Please refer to the model card for further details.
# clone project
git clone https://github.com/altndrr/vic
cd vic
# install requirements
# it will create a .venv folder in the project root
# and install all the dependencies using flit
make install
# activate virtual environment
source .venv/bin/activate
# copy .env.example to .env
cp .env.example .env
# edit .env file
vim .env
The two entry points are train.py
and eval.py
. Calling them without any argument will use the default configuration.
# train model
python src/train.py
# test model
python src/eval.py
The full list of parameters can be found under configs, but the most important ones are:
- data: dataset to use, default to
caltech101
. - experiment: experiment to run, default to
baseline/clip
. - logger: logger to use, default to
null
.
Parameters can be overwritten by passing them as command line arguments. You can additionally override any parameter from the config file by using the ++
prefix.
# train model on ucf101 dataset
python src/train.py data=ucf101 experiment=baseline/clip
# train model on ucf101 dataset with RN50 backbone
python src/train.py data=ucf101 experiment=baseline/clip model=clip ++model.model_name=RN50
Note that since all our approaches are training-free, there is virtually no difference between train.py
and eval.py
. However, we still keep them separate for clarity.
# install pre-commit hooks
pre-commit install
# run fast tests
make test
# run all tests
make test-full
# run linters
make format
# remove autogenerated files
make clean
# remove logs
make clean-logs
@misc{conti2023vocabularyfree,
title={Vocabulary-free Image Classification},
author={Alessandro Conti and Enrico Fini and Massimiliano Mancini and Paolo Rota and Yiming Wang and Elisa Ricci},
year={2023},
journal={NeurIPS},
}
We gratefully acknowledge taap studio for designing the logo of this project and ashleve/lightning-hydra-template for the template used to build this repository.