The State of Embedding-Based Retrieval in E-Commerce
E-commerce search demands a combination of speed, retrieval and ranking quality, and multimodality, creating an interesting and unique technical challenge. Other applications, like medical image search, autocomplete, legal document discovery, and RAG prioritise various combinations of these properties, but few prioritise all of them with the same importance as in e-commerce.
Users expect search results to be not only relevant but also ranked in a way that surfaces the most desirable products at the top, and they expect this to happen with sub-50ms latency, while taking into account both text meta-data (like fabric type or brand) and visual features (like style or color).
As we'll see, while recent advances in vision-language models have been impressive, there remains a significant gap between the capabilities of off-the-shelf models and the specific needs of e-commerce applications.
How Embedding Models Work
Traditionally, text search was done using keyword-based methods like BM25, which finds documents by matching words or parts of words in the query with those in the documents. This approach is fast and effective in many cases, but doesn't capture the semantic meaning of the query or documents. For example, searching for "winter clothes" might not surface any results because the product descriptions for coats, jackets, and sweaters don't contain the word "winter".
Advancements in deep learning, like the Transformer architecture, not only lead to the LLMs we use every day (like ChatGPT) but also to powerful embedding models. These models map text and images into a shared vector space, allowing for semantic search. These embedding models are incredible in their ability to "understand" image and text inputs and capture their relationships, but are incredibly complex comapred to traditional keyword-based methods. They are trained on massive datasets, often containing billions of tokens, and there is no way to easily "fix" them when they make mistakes (like ranking products sub-optimally).
The E-Commerce Gap
Despite the increasing number of multimodal embedding models being released by AI labs, there is a glaring gap in the market. While many embedding models have been released, very few are tailored to the specific demands of e-commerce. The number of fast, open-source (we don't want to train our own model from scratch or use a proprietary one), instruction tuned (opens up different use-cases using the same model), early-fusion multimodal embedding models released in the last few years is surprisingly low. Of these, they are overwhelmingly focused on document understanding tasks, like parsing charts, reading PDFs, or understanding diagrams, rather than the "item-in-the-wild" scenarios typical of e-commerce.

This discrepancy between readily available models and industry needs makes building high-quality AI search and recommendation products incredibly difficult. We are often forced to adapt models that were never designed for our specific domain.
The Ranking Problem
A deeper issue lies in how these models are trained. Most of these models are trained using InfoNCE loss as the training objective. In this training paradigm, the embedding model tries to push the correct query-item pairs closer together in the embedding space while pushing incorrect pairs apart. This approach mostly optimises for retrieval quality, i.e., ensuring that the correct items are in the top results, but does not inherently optimise the fine-grained ranking of those results.
In other words, infoNCE does not optimise for ranking quality, it optimises for retrieval quality. In most use-cases this is sufficient, since the results are either used for RAG, where ranking is not as important, or a reranker is used. The reranker method is a pervasive "duct tape" solution adopted by most AI labs, where an embedding model is used to retrieve a candidate set (say, 100 products) and then use a heavier reranking model is used to reorder them. While effective, this is far from a clean solution, especially for e-commerce:
- Latency Overhead: E-commerce demands sub-50ms latency. Running a second, often larger, model for reranking eats into this budget significantly.
- Infrastructure Complexity: It introduces compute and implementation overhead. You now need to host, scale, and maintain multiple models.
- Debugging Difficulty: It increases the test surface. If search results are poor, is it the embedding model's failure to retrieve the right items, or the reranker's failure to order them?
- Fine-tuning Nightmares: Fine-tuning becomes a complex orchestration. Do you fine-tune both? That requires different pipelines and objective functions, making domain adaptation difficult.
Qwen3-VL-Embedding
This brings us to the recent release of Qwen3-VL-Embedding by the Qwen team. This model is significant for a few reasons. It's a relatively low-latency, open-source, early-fusion multimodal embedding model released by a major AI lab. It ticks a number of the boxes we established earlier, and represents the current state-of-the-art open source embedding models quite well.
However, while being a SOTA model, it was still released with alongside a reranker, Qwen3-VL-Reranker, an indication that we still haven't solved the ranking problem at the embedding level. The fact that a reranker is still necessary to achieve high-quality rankings suggests that the embedding model itself isn't optimised for ranking quality, and that the "duct tape" solution of a reranker is still needed.
Our Approach
At Solenya, we've taken a different path. We believe that for specific domains like e-commerce, efficiency and ranking quality must be baked into the embedding model itself.
By optimising directly for ranking quality within the embedding space, we eliminate the need for a separate reranker. Our internal evaluations show that our model can achieve superior ranking performance (measured using average NDCG@10) than Qwen3-VL-Embedding 2B on in-domain e-commerce tasks, using a significantly faster model.
Our approach proves that huge models and complex multi-stage pipelines aren't the only way forward. Specialized, well-tuned models that solve the ranking problem at the embedding level can deliver better results, faster, and with less infrastructure.