This repository holds the implementation for a chatbot that allows user to enter a message and returns the best matching image from the given image dataset:
Dataset: jmhessel/newyorker_caption_contest · Datasets at Hugging Face
All codes are implemented in Python and tested using single NVIDIA RTX A5000 machine.
-
Note that for text-to-image retrieval task, a pretrained Vision-Language model is adopted. Here, we use BLIP [blog]. This repository is built upon the official Pytorch implementation of the BLIP [https://github.com/salesforce/BLIP/tree/main]. After forking the BLIP repository, the python script 'chatbot.py' for ChatBot program has been implemented. The code has been tested on PyTorch 1.10.
-
Once again, note that only 'chatbot.py' script in the folder is newly implemented.
To run the model, please refer to the Requirements.
To install dependencies for running the application, please execute the following line.
pip install -r requirements.txt
Direct link to the 'newyorker_caption_contest' dataset is here. The dataset is composed of 'train', 'test', and 'validation' sets. Each set contains 2,340, 131, and 130 images, respectively. Here, we note that only 'train' set is used for parsing image.
We used pretrained version of BLIP with ViT-B backbone, finetuned on the COCO dataset. The checkpoint can be found here.
Entire implementation of the BLIP model is borrowed from the official pytorch implementation of BLIP. Using BLIP as a feature extractor for both image and text, the ChatBot which does text-to-image retrieval task is implemented.
- Note that running BLIP for extracting image features require 3500MiB GPU memory. Also, it takes approximately 2 seconds for extracting features from each image. Considering the expensive time for feature extraction process, I prepared a preprocessed data in advance, which is composed of feature vectors extracted from the whole 'train' set (2,340 images). The preprocessed data is in the folder, under the name of 'newyorker_caption_contest.pt'.
However, user can always decide whether to use this preprocessed data or not. If user choose not to, then the feature extracting process will begin promptly during running the program.
To run the application, please run:
python chatbot.py
-
Upon running, the program will ask whether to use preprocessed dataset or not. If answered 'yes', the program will begin preprocessing the dataset. If answered 'no', the program will skip the feature extraction part and ask user for prompt input.
-
The user will be asked to input a message. Receiving a message, the program will promptly find a best matching image from the dataset and show the image. At the same time, the found image will be saved at the './chatbot_results' forder, under the name of user input.
-
After returning the image, the program will ask the user to continue or not. If answered 'yes', the user is asked again for the new message. If answered 'no', the program will end.
The running screen would look like this:
Some examples of the images retrieved from user prompts (Prompt written at the top of the image):