The goal of this project is to develop a robust product search engine that leverages multimodal data (text and images) to enhance the understanding of user preferences and deliver highly relevant and personalized product recommendations. Traditional search systems often fall short in capturing the combined semantic impact of visual and textual features, which can lead to suboptimal search results. Our solution addresses these limitations by integrating and processing multimodal data for more accurate and context-aware recommendations.
Existing product search engines primarily rely on either textual or visual features, often neglecting the interplay between these modalities. This results in limited accuracy when recommending products that require an understanding of both textual descriptions and visual elements. The challenge is to create a search engine that effectively combines these features to improve the relevance of product search results while minimizing latency.
Our approach builds on advancements in multimodal machine learning, particularly the use of models like CLIP (Contrastive Language–Image Pretraining), and employs vector databases such as Pinecone for efficient similarity search.
The project uses a subset of the McAuley Amazon dataset, which contains product metadata and user reviews across 35 categories. For this project, we focus on three categories (pet supplies, office supplies, health and household) and utilize only the product metadata, which includes titles, images, descriptions, and other product-related features. The selected subset provides a rich dataset for understanding the interplay of textual and visual features in product searches.
To ensure data relevance, products without images, those with very short titles (two words or fewer), and those with mismatched titles and images were removed. CLIP was used to identify mismatches by comparing text and image embeddings, with products scoring below a cosine similarity of 0.2 excluded.
Our approach begins with multimodal embedding creation, where CLIP is utilized to generate embeddings for product titles and corresponding images. This compute-intensive task was managed using Google Cloud Platform's Vertex AI with an NVIDIA L4 GPU, 16 vCPUs, and 64GB RAM, supported by DASK for parallel processing. These embeddings are stored in Pinecone, a vector database, with two separate vector databases created for each category: one containing only image embeddings and the other a linear combination of text and image embeddings.
For query processing, a user query undergoes zero-shot classification to identify its category, effectively narrowing the search space and minimizing latency. The query is then embedded using the same tokenizer as the product metadata for text and image embeddings. During the similarity search phase, the embedded query retrieves the top 10 most similar products from both vector databases using cosine similarity.
For re-ranking, a second zero-shot classification determines whether the query contains abstract or visualizable terms. If the query is abstract and not easily visualized, a weighted combination of results is applied, assigning 75% weight to cosine similarity scores from the image-only vector database and 25% weight to scores from the text+image vector database. Finally, the results are re-ranked based on the weighted cosine similarity, and the top 10 products are selected for presentation.