E-commerce search demands a combination of speed, retrieval and ranking quality, and multimodality, creating an interesting and unique technical challenge. Other applications, like medical image search, autocomplete, legal document discovery, and RAG ¹ prioritise various combinations of these properties, but few prioritise all of them with the same importance as in e-commerce.

Users expect search results to be not only relevant but also ranked in a way that surfaces the most desirable products at the top, and they expect this to happen with sub-50ms latency, while taking into account both text meta-data (like fabric type or brand) and visual features (like style or color).

As we'll see, while recent advances in vision-language models have been impressive, there remains a significant gap between the capabilities of off-the-shelf models and the specific needs of e-commerce applications.

How Embedding Models Work

Traditionally, text search was done using keyword-based methods like BM25 ^{2, 3}, which finds documents by matching words or parts of words in the query with those in the documents. This approach is fast and effective in many cases, but doesn't capture the semantic meaning of the query or documents. For example, searching for "winter clothes" might not surface any results because the product descriptions for coats, jackets, and sweaters don't contain the word "winter".

Illustrative examples of a search query matching to product titles based on semantic meaning.

Advancements in deep learning, like the Transformer architecture ^{4, 5}, not only lead to the LLMs we use every day (like ChatGPT) but also to powerful embedding models. These models map text and images into a shared vector space, allowing for semantic search. These embedding models are incredible in their ability to "understand" image and text inputs and capture their relationships, but are incredibly complex comapred to traditional keyword-based methods. They are trained on massive datasets, often containing billions of tokens, and there is no way to easily "fix" them when they make mistakes (like ranking products sub-optimally).

Embedding model pipeline.

The E-Commerce Gap

Despite the increasing number of multimodal embedding models being released by AI labs, there is a glaring gap in the market. While many embedding models have been released, very few are tailored to the specific demands of e-commerce. The number of fast, open-source (we don't want to train our own model from scratch or use a proprietary one), instruction tuned (opens up different use-cases using the same model), early-fusion multimodal embedding models released in the last few years is surprisingly low. Of these, they are overwhelmingly focused on document understanding tasks, like parsing charts, reading PDFs, or understanding diagrams, rather than the "item-in-the-wild" scenarios typical of e-commerce.

Waterfall chart showing progressive filtering of 589 MTEB embedding models down to 1 model family meeting all e-commerce criteria. — Progressive filtering of all 589 embedding models in the MTEB registry. Each bar shows how many models remain after applying an additional requirement for e-commerce search. Only one model family survives all filters.

This discrepancy between readily available models and industry needs makes building high-quality AI search and recommendation products incredibly difficult. We are often forced to adapt models that were never designed for our specific domain.

The Ranking Problem

A deeper issue lies in how these models are trained. Most of these models are trained using contrastive loss ^{6, 7} as the training objective. In this training paradigm, the embedding model tries to push the correct query-item pairs closer together in the embedding space while pushing incorrect pairs apart. This approach mostly optimises for retrieval quality, i.e., ensuring that the correct items are in the top results, but does not inherently optimise the fine-grained ranking of those results.

Diagram showing how infoNCE loss works

In other words, infoNCE does not optimise for ranking quality, it optimises for retrieval quality. In most use-cases this is sufficient, since the results are either used for RAG, where ranking is not as important, or a reranker ⁸ is used. The reranker method is a pervasive "duct tape" solution adopted by most AI labs, where an embedding model is used to retrieve a candidate set (say, 100 products) and then use a heavier reranking model is used to reorder them. While effective, this is far from a clean solution, especially for e-commerce:

Latency Overhead: E-commerce demands sub-50ms latency. Running a second, often larger, model for reranking eats into this budget significantly.
Infrastructure Complexity: It introduces compute and implementation overhead. You now need to host, scale, and maintain multiple models.
Debugging Difficulty: It increases the test surface. If search results are poor, is it the embedding model's failure to retrieve the right items, or the reranker's failure to order them?
Fine-tuning Nightmares: Fine-tuning becomes a complex orchestration. Do you fine-tune both? That requires different pipelines and objective functions, making domain adaptation difficult.

Qwen3-VL-Embedding

This brings us to the recent release of Qwen3-VL-Embedding ⁹ by the ¹⁰. This model is significant for a few reasons. It's a relatively low-latency, open-source, early-fusion multimodal embedding model released by a major AI lab. It ticks a number of the boxes we established earlier, and represents the current state-of-the-art open source embedding models quite well.

However, while being a SOTA model, it was still released alongside a reranker, Qwen3-VL-Reranker ¹¹, an indication that we still haven't solved the ranking problem at the embedding level. The fact that a reranker is still necessary to achieve high-quality rankings suggests that the embedding model itself isn't optimised for ranking quality, and that the "duct tape" solution of a reranker is still needed.

Our Approach

At Solenya, we've taken a different path. We believe that for specific domains like e-commerce, efficiency and ranking quality must be baked into the embedding model itself.

By optimising directly for ranking quality within the embedding space, we eliminate the need for a separate reranker. Our internal evaluations show that our model can achieve superior ranking performance (measured using average NDCG@10) than Qwen3-VL-Embedding 2B on in-domain e-commerce tasks, using a significantly faster model.

Our approach proves that huge models and complex multi-stage pipelines aren't the only way forward. Specialized, well-tuned models that solve the ranking problem at the embedding level can deliver better results, faster, and with less infrastructure.

Google Cloud. Retrieval Augmented Generation (RAG) [Internet]. 2024. Available from: https://cloud.google.com/use-cases/retrieval-augmented-generation?hl=en

ParadeDB. How BM25 Works [Internet]. 2024. Available from: https://www.paradedb.com/learn/search-concepts/bm25#how-bm25-works

Robertson S, Zaragoza H. The Probabilistic Relevance Framework: BM25 and Beyond. Foundations and Trends in Information Retrieval. 2009;3(4):333–389.

Amanatulla. Transformer Architecture Explained [Internet]. 2024. Available from: https://medium.com/@amanatulla1606/transformer-architecture-explained-2c49e2257b4c

Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al. Attention Is All You Need. Advances in Neural Information Processing Systems. 2017;30.

Weng L. Contrastive Representation Learning [Internet]. 2021. Available from: https://lilianweng.github.io/posts/2021-05-31-contrastive/

van den Oord A, Li Y, Vinyals O. Representation Learning with Contrastive Predictive Coding. In: Advances in Neural Information Processing Systems. 2018.

Pinecone. Rerankers [Internet]. 2024. Available from: https://www.pinecone.io/learn/series/rag/rerankers/

Qwen Team. Qwen3-VL-Embedding-2B [Internet]. 2026. Available from: https://huggingface.co/Qwen/Qwen3-VL-Embedding-2B

10.

Qwen Team. Qwen Research [Internet]. 2026. Available from: https://qwen.ai/research

11.

Qwen Team. Qwen3-VL-Reranker-2B [Internet]. 2026. Available from: https://huggingface.co/Qwen/Qwen3-VL-Reranker-2B