Kimchi 1: Product Search Enters the Post-Training Era

Modern LLMs are produced by a three-stage pipeline: large-scale pretraining, supervised fine-tuning (SFT) on instruction-formatted data, and a post-training alignment stage driven by reinforcement learning from human feedback (RLHF) ¹ or reinforcement learning from AI feedback (RLAIF) ². The post-training stage, popularised by InstructGPT and ChatGPT in 2022, is now treated as load-bearing rather than optional: it is where deployment—shaped objectives-helpfulness, harmlessness, instruction-following—are imposed on models that pretraining alone leaves under-specified.

Information retrieval has its own staged pipeline—large-scale contrastive pretraining ³, in-batch negative mining, supervised fine-tuning ^{4, 5}, distillation—and until recently it lacked an analogous post-training stage. That gap has only begun to close: policy-gradient ranking via Plackett–Luce models ⁶, GRPO over dense retrieval embeddings in production e-commerce ^{7, 8}, utility-aligned embedding distillation ⁹, and preference-aligned cross-modal representations ¹⁰ have all demonstrated that RL can improve retrieval beyond what contrastive surrogates achieve. But a consistent pattern persists across these approaches: the reward signal is either derived from logged user interactions (clicks, purchases, session transactions), distilled from a co-trained or black-box LLM, or tied to a specific downstream generative task. None provides a frozen, auditable, rubric-graded reward channel that can be deployed on day zero for a new merchant with no historical traffic.

This is not a product announcement—read Introducing Kimchi 1 for the benchmarks, NDCG@10 scores, and GA details. This post is a technical dissection of the machinery underneath and the research programme it belongs to. We argue that the open question is no longer whether to post-train retrieval, but what reward signal to post-train against—and that the construction below, Plackett–Luce policy gradients under a frozen, rubric-graded RLAIF judge, is a design point with specific properties the alternatives lack.

1. The Objective Mismatch Between Training and Deployment

Retrieval and recommendation systems are deployed as rankers. At serve time the model scores a candidate pool and a top- $k$ subset is returned. But most production training loops still optimise contrastive ³, pointwise, or supervised listwise losses ^{4, 5} on logged or mined candidate sets. These signals only partially expose the combinatorial structure of the served list, the positional discount with which users (or downstream agents) actually consume it, and the off-policy gap between the deployed scorer and the data the next gradient step is computed from.

A model trained on an in-batch contrastive surrogate may still rank suboptimally at serve time because the surrogate does not score the list-level decision that will be deployed. The gap between "this item is relevant" and "this ordered list is optimal" has a specific structure:

Positional discount. Users and shopping agents consume top-down. An item at position 1 carries exponentially more weight than the same item at position 10.
Asymmetric specificity. A generic "running shoe" is acceptable evidence for the broad query running shoes, but should be penalised under the narrow query trail running shoes. We formalised this property in our evaluation rubric—see A New Age in Evaluation: The LLM Judge—and it is the single largest source of disagreement between naive embedding similarity and human relevance judgements.
Combinatorial dependence. The utility of an item depends on what else is in the list. Redundancy, category coverage, and complementarity are list-level properties invisible to any pointwise or pairwise surrogate.

This mismatch is structurally identical to the gap between cross-entropy next-token prediction and response-level helpfulness that RLHF was invented to close ¹. We cannot cross it with contrastive losses. We have to optimise the list directly.

2. A Maturing Field

Post-training for retrieval is not a hypothesis. Multiple research streams have converged on it simultaneously, and that convergence is itself the signal.

PL-Rank ¹¹ (Oosterhuis, SIGIR 2021) established the computational scaffolding. By exploiting the specific structure of Plackett–Luce models rather than relying on generic policy gradients, Oosterhuis achieved greater sample efficiency and lower computational cost—the vectorised bedrock that makes this approach practical at catalogue scale.

Neural PG-RANK ⁶ (Gao et al., 2023) demonstrated the scaffold we build on: an LM as a Plackett–Luce policy, trained end-to-end via REINFORCE with leave-one-out baselines against downstream utility (nDCG, BLEU). Their work established that policy-gradient ranking over PL models is tractable and competitive.

HARR ¹² (Zhang et al., 2026) verified the empirical viability of RL for dense retrievers. By replacing deterministic top- $k$ with stochastic PL sampling and applying GRPO with history-aware state representations, they showed that a lightweight retriever could be post-trained in approximately 3 hours on a single GPU node—orders of magnitude cheaper than LLM fine-tuning. Their results demonstrated consistent improvements across datasets, RAG pipelines, and retriever scales.

LarPO ¹³ (Jin et al., ICML 2025) charted the bidirectional bridge between IR and LLM alignment. They showed that IR techniques—ranking objectives (ListMLE, LambdaRank), hard negative mining, candidate list construction—map directly onto LLM alignment, achieving 38.9% improvement on AlpacaEval2. The bridge runs both ways: LarPO brought information retrieval into LLM alignment. Kimchi brings LLM alignment back home to information retrieval.

Retrieval-GRPO ⁷ (Chen et al., 2025) and the earlier Taobao SSMDP ⁸ (Hu et al., KDD 2018) prove that RL post-training works in production e-commerce search at scale—Taobao reported a 30% transaction-volume increase over supervised LTR. These deployments use click-derived, session-level, and multi-objective reward signals tied to historical user behaviour.

The tools evolution we described in The Three Waves of Agentic Tools predicted this: Wave 3 dissolves the boundary between tool and agent by letting RL discover the tool's own operating strategies. In the cognitive core thesis, we argued that the model is becoming commodity and the tool is becoming the product. Post-training the retrieval tool—rather than the consuming LLM—is the direct implementation of that thesis. It is the dual of the standard agent-alignment approach: hold the agent fixed, fine-tune the tool. One gradient pass on a dual encoder. Every agent that calls the tool inherits the improvement at zero marginal inference cost.

That design choice—the reward channel—is where the approaches examined below diverge.

3. The Reward Channel: RLAIF Under a Frozen Rubric Judge

Placing this construction within the RL-from-feedback taxonomy matters, because the taxonomy determines the audit surface and the failure modes.

RLHF ¹ replaces a hand-coded reward with a learned scalar fit to human pairwise preferences. RLAIF ² replaces the human labeller with an AI labeller and is competitive with RLHF when the labeller is sufficiently capable. RLVR ^{14, 15} replaces the learned scalar with a rule-checkable signal—pass/fail unit tests, exact-match equality, format conformance—and is hard to game except by solving the task.

Rubrics-as-rewards ¹⁶ extends RLVR to domains without a closed-form correctness check by encoding the correctness check as a structured rubric scored by a frozen LLM judge. Our construction sits firmly in the rubrics-as-rewards / RLAIF intersection: the AI labeller is frozen (not co-trained with the policy), and the rubric is external (versioned, calibrated against human labels, and treated as a measurement instrument rather than a truth oracle).

It is not RLVR. Commercial product relevance is not binary pass/fail. It is graded, query-conditional, taxonomy-sensitive, and often visible only in product imagery ¹⁷. A query for "wedding guest dress" and a query for "summer dress" may overlap in the catalogue, but the relevance judgement is fundamentally different. The rubric must encode asymmetric specificity, substitute/complement distinctions, and multimodal evidence weighting—exactly the properties we built into our evaluation judge (A New Age in Evaluation: The LLM Judge).

The reason to use a judge at all is measurement validity. Click-derived labels are not neutral relevance labels: they are a record of what an older system chose to expose, filtered through position bias and historical traffic ^{18, 20}. We examined this problem in detail in Product Search Needs Better Evals—public benchmarks are useful lab equipment, but commerce needs real queries, graded labels, images, metadata, and honest caveats about historical bias. The calibrated rubric judge is the measurement instrument that makes post-training possible without click logs.

This reward-signal design also distinguishes our construction from concurrent RL-for-retrieval work. Retrieval-GRPO ⁷ post-trains Taobao's dense retriever via multi-objective GRPO, but its relevance reward is generated by TaoSR1, a co-trained 42B MoE whose internals are opaque to the retriever team—an unstable reward surface that drifts with the MoE's own training schedule. UAE ⁹ distils LLM perplexity reduction into a bi-encoder, binding the retriever's quality ceiling to a specific downstream generator. MAPLE ¹⁰ uses MLLM logit distributions as preference signals, coupling the retriever to the MLLM's modality-alignment quality. In each case the reward channel is either co-evolving (Retrieval-GRPO), task-bound (UAE), or model-bound (MAPLE). Our frozen, versioned, calibrated rubric judge is none of these: it is an instrument that can be audited, re-calibrated, and deployed independently of any downstream consumer.

We built the rubric-graded judge as an evaluation tool (How We Do Evals). The key insight is that the same calibrated judge that drives offline evaluation can serve as the frozen reward channel for policy optimisation. This is a dual use of measurement: what measures relevance also defines relevance for the gradient step.

The judge $J_\phi$ evaluates each $(q, c)$ pair against a structured rubric producing independent textual and visual relevance scores on a $1$ – $4$ ordinal scale, plus short per-axis explanations so disagreements can be audited rather than averaged away. The key rubric rule is asymmetric specificity: writing $q' \sqsupset q$ for " $q'$ is a strict specialisation of $q$ in the catalogue taxonomy",

J_\phi(q', c) \leq J_\phi(q, c) \quad \text{whenever } c \text{ is generic w.r.t. } q'

So a generic running shoe is acceptable evidence for the broad query running shoes (high $J_\phi$ ) but should be penalised under the narrow query trail running shoes (low $J_\phi$ ). The rubric also distinguishes exact matches, substitutes, complements, category errors, visual evidence, and metadata evidence.

Treating the calibrated judge as the frozen reward channel, the per-list reward is:

R(q, \sigma) = \sum_{i=1}^k w_i \cdot J_\phi(q, \sigma_i)

with positional weights $w_i \geq 0$ . Setting $w_i = 1 / \log_2(i + 1)$ recovers a discounted-cumulative-gain shape ²¹ and gives the policy gradient a NDCG-flavoured per-position credit signal; setting $w_i = 1$ recovers a flat sum closer to the "verifiable correctness" framing of RLVR. We expose this as a hyperparameter.

Crucially, $R(q, \sigma)$ is linear in the per-item judgments: it can be written as a sum of per-rank contributions $R_i(q, \sigma) = w_i J_\phi(q, \sigma_i)$ , each depending only on $\sigma_i$ and not on the rest of $\sigma$ . This is a substantive modelling choice. It rules out set-level objectives such as NDCG normalised by IDCG, diversity, subtopic coverage, or redundancy penalties—those require non-additive rewards. In exchange, linearity is what enables the per-rank variance-reduction factorisation.

4. Retrieval as a Stochastic Ranking Policy

To apply reinforcement learning to retrieval, the scoring model must become a stochastic policy rather than a deterministic sorter.

Let $\mathcal{C}_q \subseteq \mathcal{C}$ denote the candidate pool exposed to the ranker for query $q$ , and $s_\theta : (q, c) \to \mathbb{R}$ a parameterised scoring function—typically a dual encoder or a multimodal scorer. The scorer induces a categorical distribution over items via a softmax with temperature $\tau$ :

\pi_\theta(c \mid q) = \frac{\exp(s_\theta(q, c) / \tau)}{\sum_{c' \in \mathcal{C}_q} \exp(s_\theta(q, c') / \tau)}

A top- $k$ ranking $\sigma = (\sigma_1, \dots, \sigma_k)$ is a length- $k$ sequence of distinct items from $\mathcal{C}_q$ . The natural distribution on rankings induced by $s_\theta$ is the Plackett–Luce (PL) distribution ^{22, 23}, obtained by sampling without replacement proportionally to the softmax weights:

P_\theta(\sigma \mid q) = \prod_{i=1}^k \frac{\exp(s_\theta(q, \sigma_i) / \tau)}{\sum_{c \in \mathcal{C}_q \setminus \sigma_{< i}} \exp(s_\theta(q, c) / \tau)}

where $\sigma_{< i} = \{\sigma_1, \dots, \sigma_{i-1}\}$ is the set of items already drawn. The log-density admits a tractable factorisation over ranks—the per-slot conditional is just a softmax over the remaining items—which is what makes the policy gradient computable. PL models have a long history in learning-to-rank: ListNet and ListMLE ^{4, 5} use them for supervised ranking, TF-Ranking ²⁴, MDPRank ²⁵, ApproxNDCG ²⁶, and PL-Rank ¹¹ extend the model to gradient-based or RL settings, and Neural PG-RANK ⁶ instantiates a full REINFORCE + LOO baseline over PL rankings. Our construction shares this scaffold but replaces the downstream-task reward with a multimodal frozen rubric-graded judge—decoupling the reward channel from any specific generator or click stream—and adds the per-rank Rao–Blackwell variance reduction techniques that list-level estimators do not exploit.

6. Product Search Enters the Post-Training Era

For years, improving search meant either inflating the pretraining dataset or bolting complex heuristic rerankers onto a weak dual-encoder base. We treated the ranking problem as an approximation issue, relying on contrastive proxies to guess what a good ranked list might look like. That is the two-stage pipeline: pretrain, fine-tune, deploy.

That era is ending. Just as language models required post-training to bridge the gap between predicting the next token and fulfilling user intent, retrieval models need post-training to bridge the gap between measuring similarity and constructing a coherent, high-utility top- $k$ list for a specific query distribution, a specific relevance definition, and a specific deployment context.

With Kimchi 1, we have formalised a specific instantiation of the third stage of the retrieval pipeline—one designed around an auditable, frozen rubric judge and the theoretical properties (per-rank variance reduction, temperature-regime bounds) that make it tractable at catalogue scale without historical interaction data—and deployed it in production. The evaluation rubric we built to measure relevance (blog 25, blog 26) became the reward signal we use to optimise relevance. The tools-as-models thesis (blog 21) and the cognitive-core argument (blog 22) predicted that RL would reach the tool layer. It has—and because the reward channel is decoupled from any single consumer, the same post-trained retriever unlocks multimodal applications (image-grounded relevance, visual-similarity ranking) and co-optimised tool chains where multiple RL-tuned components compose without retraining each other.

Post-trained search is one component in Solenya's zero-shot discovery training recipe, alongside contributions in zero-shot discovery, recommendations, multimodality, evals and personalization. We hope in the future to share more of these innovations from members of our team, to help bring the best e-commerce experiences everywhere and to everyone.

Welcome to the post-training era.

Ouyang L, Wu J, Jiang X, Almeida D, Wainwright CL, Mishkin P, et al. Training Language Models to Follow Instructions with Human Feedback. In: Advances in Neural Information Processing Systems (NeurIPS). 2022.

Lee H, Phatale S, Mansoor H, Mesnard T, Ferret J, Lu K, et al. RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback. 2023.

Oord A van den, Li Y, Vinyals O. Representation Learning with Contrastive Predictive Coding. 2018.

Cao Z, Qin T, Liu T-Y, Tsai M-F, Li H. Learning to Rank: From Pairwise Approach to Listwise Approach. In: Proceedings of the 24th International Conference on Machine Learning. 2007. p. 129–136.

Xia F, Liu T-Y, Wang J, Zhang W, Li H. Listwise Approach to Learning to Rank: Theory and Algorithm. In: Proceedings of the 25th International Conference on Machine Learning. 2008. p. 1192–1199.

Gao G, Setty S, Anand A, Hasibi F. Policy-Gradient Training of Language Models for Ranking. In: arXiv preprint arXiv:231004407. 2023.

Chen H, others. TaoSearchEmb: A Multi-Objective Reinforcement Learning Framework for Dense Retrieval in Taobao Search. arXiv preprint arXiv:251113885. 2025;

Hu Y, Da Q, Zeng A, Yu Y, Xu Y. Reinforcement Learning to Rank in E-Commerce Search Engine: Formalization, Analysis, and Application. In: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD). 2018.

Jiang Z, others. Aligning Dense Retrievers with LLM Utility via Distillation. arXiv preprint arXiv:260422722. 2026;

10.

Wang X, others. Guiding Cross-Modal Representations with MLLM Priors via Preference Alignment. In: Advances in Neural Information Processing Systems (NeurIPS). 2025.

11.

Oosterhuis H. Computationally Efficient Optimization of Plackett-Luce Ranking Models for Relevance and Fairness. In: Proceedings of the 44th International ACM SIGIR Conference. 2021.

12.

Zhang Y, Qin Z, Wang Z, Wu W, Deng S. Reinforcement Fine-Tuning for History-Aware Dense Retriever in RAG. Proceedings of the International Conference on Machine Learning (ICML). 2026;

13.

Jin B, Yoon J, Qin Z, Wang Z, Xiong W, Meng Y, et al. LLM Alignment as Retriever Optimization: An Information Retrieval Perspective. In: Proceedings of the International Conference on Machine Learning (ICML). 2025.

14.

Lambert N, others. Reinforcement Learning with Verifiable Rewards: GRPO’s Effective Loss, Dynamics, and Success Amplification. 2025.

15.

Guo D, Yang D, Zhang H, others. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. 2025.

16.

Viswanathan V, others. Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains. 2025.

17.

Reddy CK, Màrquez L, Valero F, Rao N, Zaragoza H, Bandyopadhyay S, et al. Shopping Queries Dataset: A Large-Scale ESCI Benchmark for Improving Product Search. In 2022.

18.

Joachims T, Swaminathan A, Schnabel T. Unbiased Learning-to-Rank with Biased Feedback. In: Proceedings of the Tenth ACM International Conference on Web Search and Data Mining. 2017. p. 781–789.

19.

Ai Q, Bi K, Luo C, Guo J, Croft WB. Unbiased Learning to Rank with Unbiased Propensity Estimation. In: Proceedings of the 41st International ACM SIGIR Conference. 2018. p. 385–394.

20.

Wang X, Golbandi N, Bendersky M, Metzler D, Najork M. Position Bias Estimation for Unbiased Learning to Rank in Personal Search. In: Proceedings of the 11th ACM International Conference on Web Search and Data Mining (WSDM). 2018. p. 610–618.

21.

Järvelin K, Kekäläinen J. Cumulated Gain-Based Evaluation of IR Techniques. ACM Transactions on Information Systems. 2002;20(4):422–446.

22.

Plackett RL. The Analysis of Permutations. Journal of the Royal Statistical Society Series C (Applied Statistics). 1975;24(2):193–202.

23.

Luce RD. Individual Choice Behavior: A Theoretical Analysis. Wiley; 1959.

24.

Pasumarthi RK, Bruch S, Wang X, Li C, Bendersky M, Najork M, et al. TF-Ranking: Scalable TensorFlow Library for Learning-to-Rank. In: Proceedings of the 25th ACM SIGKDD International Conference. 2019. p. 2970–2978.

25.

Wei Z, Xu J, Lan Y, Guo J, Cheng X. Reinforcement Learning to Rank with Markov Decision Process. In: Proceedings of the 40th International ACM SIGIR Conference. 2017. p. 945–948.

26.

Bruch S, Zoghi M, Bendersky M, Najork M. Revisiting Approximate Metric Optimization in the Age of Deep Neural Networks. In: Proceedings of the 42nd International ACM SIGIR Conference. 2019. p. 1241–1244.