Skip to content

PyTorch code for Chatbot using pretrained BLIP

License

Notifications You must be signed in to change notification settings

seunghyunni/Chatbot_Exercise

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

73 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository for ChatBot Design Application exercise.

This repository holds the implementation for a chatbot that allows user to enter a message and returns the best matching image from the given image dataset:

Dataset: jmhessel/newyorker_caption_contest · Datasets at Hugging Face

All codes are implemented in Python and tested using single NVIDIA RTX A5000 machine.

  • Note that for text-to-image retrieval task, a pretrained Vision-Language model is adopted. Here, we use BLIP [blog]. This repository is built upon the official Pytorch implementation of the BLIP [https://github.com/salesforce/BLIP/tree/main]. After forking the BLIP repository, the python script 'chatbot.py' for ChatBot program has been implemented. The code has been tested on PyTorch 1.10.

  • Once again, note that only 'chatbot.py' script in the folder is newly implemented.

To run the model, please refer to the Requirements.

Requirements

To install dependencies for running the application, please execute the following line.

pip install -r requirements.txt

Dataset

Direct link to the 'newyorker_caption_contest' dataset is here. The dataset is composed of 'train', 'test', and 'validation' sets. Each set contains 2,340, 131, and 130 images, respectively. Here, we note that only 'train' set is used for parsing image.

Model

We used pretrained version of BLIP with ViT-B backbone, finetuned on the COCO dataset. The checkpoint can be found here.

Code references and guidelines to new codes

Entire implementation of the BLIP model is borrowed from the official pytorch implementation of BLIP. Using BLIP as a feature extractor for both image and text, the ChatBot which does text-to-image retrieval task is implemented.

  • Note that running BLIP for extracting image features require 3500MiB GPU memory. Also, it takes approximately 2 seconds for extracting features from each image. Considering the expensive time for feature extraction process, I prepared a preprocessed data in advance, which is composed of feature vectors extracted from the whole 'train' set (2,340 images). The preprocessed data is in the folder, under the name of 'newyorker_caption_contest.pt'.

However, user can always decide whether to use this preprocessed data or not. If user choose not to, then the feature extracting process will begin promptly during running the program.

Testing

To run the application, please run:

python chatbot.py
  1. Upon running, the program will ask whether to use preprocessed dataset or not. If answered 'yes', the program will begin preprocessing the dataset. If answered 'no', the program will skip the feature extraction part and ask user for prompt input.

  2. The user will be asked to input a message. Receiving a message, the program will promptly find a best matching image from the dataset and show the image. At the same time, the found image will be saved at the './chatbot_results' forder, under the name of user input.

  3. After returning the image, the program will ask the user to continue or not. If answered 'yes', the user is asked again for the new message. If answered 'no', the program will end.

The running screen would look like this:

alt text

Some examples of the images retrieved from user prompts (Prompt written at the top of the image):

alt text alt text alt text

About

PyTorch code for Chatbot using pretrained BLIP

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 72.2%
  • Python 27.8%