This repository contains an implementation of CLIP-based text-to-image and image-to-text retrieval using the Deep Fashion dataset. The goal of this project is to gain a better understanding of how CLIP works by training a model to embed images and their corresponding descriptions in a shared semantic space. By doing so, the model can learn to bring similar images and descriptions closer together while pushing dissimilar ones further apart. This approach allows for efficient retrieval of images based on textual queries or vice versa. As part of this project, we plan to continue exploring different datasets and fine-tuning our model to achieve even better performance. We welcome contributions from the community and look forward to collaborating with others interested in this exciting area of research.
Using tensorflow similarity
References: https://openai.com/research/clip https://github.com/tensorflow/similarity/blob/master/examples/multimodal_example.ipynb
deep fashion dataset: https://github.com/yumingj/DeepFashion-MultiModal