Beyond Understanding: Part 2 of Building Multimodal AI Visual Search Demo

Building on our previous post, we have started to build a system whose goal is to use different modalities to identify a retail product. The end goal is to have a user input images that they have taken, speech, and potentially text that can identify a product they are looking for.

The real-world application of this would be where a multimodal AI model sits in the middle of a business process. In this business process, it takes pictures and uploads them or speaks about the dimensions of the product, and a set of associated products gets displayed.

In this article, we build the foundations needed a data pipeline that ingests the images of a product, stores them, and makes them accessible for similarity search.

Data Pipeline

The goal of the pipeline is to source, store, and secure images and their metadata. We sourced the product images and their associated metadata directly from the retailer's e-commerce website using their public API endpoint. The metadata was organized into Delta tables on Databricks. Vector embeddings were created from the images using the CLIP ViT-B/32 model across 512 dimensions.. These embeddings are used as inputs to the Approximate Nearest Neighbor (ANN) process for a similarity search.

User Experience

The users interact with a Streamlit application that allows them to upload an image. This image is stored in Azure Blob Storage. As the image enters the container, it triggers an API call to generate an embedding on Databricks. The embedding is then used through Databricks Vector Search to identify the top 5 matches. SAS tokens are used for the storage and retrieval of images from the BLOB storage container.

Code Artifacts

Here is the link to our repo, where the following components were used to deliver the solution:

config.py: Centralized configuration using environment variables.
azure_storage.py: Handles secure file uploads to Azure Blob Storage with a unique naming convention (YYYYMMDD_UUID.ext).
image_processor.py: Validates and converts the uploaded image to a consistent format.
databricks_client.py: Wraps Databricks API calls to trigger and poll the job until it is complete.
app.py: The Streamlit frontend that orchestrates the entire flow and displays the results.

Results

We collected a dataset consistenting of 300 product images. We identify the top 5 similar matches in seconds with accuracy along with model number and price when provided with an input image of a potential product picture.

We use Databricks Vector Search to compares image embeddings created by state-of-the-art models, CLIP ViT-B/32 (512 dimensions) and Sentence Transformers (384-768 dimensions). Our goal was to start creating governed data foundation capabilities to deploy advanced AI for solving complex real-world retail challenges.

How does CX Data Labs Help?

We specialize in the protocols and practices for creating vector embeddings and optimizing each modality-specific encoder process. Specifically, we have expertise in the creation of the following:

Vector Embeddings: We specialize in Vector embeddings, which are numerical representations of data (like text, images, or audio) in a high-dimensional space, allowing machines to understand and compare their meaning or features mathematically.
Cross-Modal Alignment: We construct data pipelines to synchronize and link related data across modalities (e.g., video to audio) using timestamps and identifiers for semantic coherence.
Vector Databases: We persist the outputs into performant and specialized datastores such as Pinecone, Milvus, Chroma, and index high-dimensional embeddings for fast Approximate Nearest Neighbor (ANN) similarity searches.
Metadata Management: We create context and update the catalog using tools like Azure Purview and AWS Glue Data Catalog with the organization and discoverability for unified datasets.

About CX Data Labs

CX Data Labs is a modern data and analytics consulting firm that empowers organizations to transform their data utilization to achieve tangible business outcomes. With deep experience from Fortune 50 environments, we specialize in building and executing data strategies that seamlessly align data, technology, and business goals. Our approach emphasizes hands-on execution, guiding clients from defining a comprehensive data strategy to building scalable data platforms and robust data engineering pipelines, including the specialized infrastructure needed for multimodal AI. Whether it's structuring KPIs, implementing governance, or designing cloud-native architectures, CX Datalabs ensures every decision is driven by measurable business value.

Retail runs on data, but most of it lives in silos — your POS here, your e-commerce there, your supply chain somewhere else. Add in customer reviews, clickstreams, and images, and the picture gets even messier. That chaos slows down every attempt to innovate with AI.

At CX Data Labs, we bring order to the noise. We unify retail’s scattered data into a governed, Azure-native platform on Databricks — clean, connected, and ready for AI. Teams get fast, safe sandboxes to test new ideas, while production pipelines turn the winners into engines for personalization, demand forecasting, and customer experience at scale.

Featured articles