From Prompt Engineering to Harness Engineering: Why E-Commerce Agents Keep Failing
The AI industry has undergone three distinct engineering paradigms in a couple of years. The first was prompt engineering: carefully crafting instructions to coax useful outputs from a language model. The second was context engineering: ensuring the model has the right information at the right time via retrieval, memory, and structured context windows. The third, and the one that actually ships production systems, is harness engineering. This is the discipline of designing the entire orchestration layer that wraps around a reasoning core: tools, guardrails, evaluation pipelines, grounding mechanisms, fallback logic, and UI orchestration.
In developer tooling, this progression is well understood. Cursor and GitHub Copilot are not better because they have a better model. They are better because they have a better harness, the scaffolding that transforms a raw LLM into a reliable, grounded production agent. The model reasons. The harness acts.
E-commerce has not caught up. The industry is still in the prompt engineering era, wrapping an LLM in a chat bubble and hoping for the best. The result is predictable: an ocean of chatbots that hallucinate products, repeat items, lose context, and feel like FAQ bots wearing a conversational skin. Meanwhile, the few retailers who have attempted harness engineering in-house have discovered just how fragile the result is without deep retrieval integration and systematic evaluation. This is what happens when you skip the harness.
The Market Has a Harness Problem
To understand the gap, we ran a structured competitive analysis. We tested five live e-commerce chatbots against real user scenarios with discovery queries, occasion-based shopping, budget constraints, and edge cases. The goal was not to rank chat UIs. It was to identify where harness engineering was present and where it was absent.
The evaluation framework focused on four questions for each system:
- What's working? * Features worth stealing
- What's broken? * Trust-destroying failures
- What's missing? * Gaps we can exploit
- What does this reveal about the harness? * Is there orchestration behind the chat, or just an LLM with a product feed?
What we found confirmed the thesis: the best experiences had fragments of harness thinking, but none had committed to it fully.
What We Observed
The strongest performer maintained context across multi-turn conversations, offered follow-up prompts, and behaved like an expert shop assistant capable of comparison and research. But even here, the UX was disconnected from the intelligence. The harness was partially built but not fully integrated into the product surface.
Mid-tier systems were fast with solid catalogue knowledge and informative responses. Some pushed users toward chat with clever in-listing prompts. But they showed repeat products, lost conversation context, and had no mechanism for clarifying ambiguous intent. The retrieval was decent. The orchestration was absent.
The weakest systems were slow, text-heavy, prone to hallucination, and gave no signal of progress. They had no world knowledge, no understanding of user constraints, and no recovery path when they failed. These are what you get when the harness is just "pass the query to the LLM and render whatever comes back."
The pattern was consistent: every failure we observed was a harness failure, not a model failure. The models could reason. They just had nothing to reason with and no system to keep them grounded.
What made this especially revealing is that the strongest and weakest systems likely used comparable foundation models. The gap between a contextual expert assistant and a slow, hallucinating FAQ bot was not intelligence. It was infrastructure. The systems that felt good had fragments of orchestration: managed context, retrieval grounding, follow-up logic. The systems that felt broken had none of this, just a raw model connected to a product feed with no intermediary guaranteeing quality.
What an Agent Harness Must Enforce
Based on our analysis, a well-engineered e-commerce agent harness must enforce the following constraints, not as nice-to-haves, but as system-level requirements:
- Grounded retrieval + Every product shown must exist in the catalogue. No hallucination. No stale inventory. The harness must guarantee this at the tool layer, not hope the model gets it right.
- Speed + Sub-second responses for basic queries. The harness must manage latency budgets across retrieval, reasoning, and rendering. A loading spinner is a harness failure.
- Brevity + Agents should not be wordy. The harness must constrain output length and structure responses for action, not for reading.
- Context retention + Multi-turn state must be managed by the harness, not left to the model's context window. Session memory, constraint tracking, and preference accumulation are infrastructure problems.
- Graceful recovery + When intent is ambiguous or results are poor, the harness must trigger clarifying prompts, fallback strategies, or explicit acknowledgment — not silence or repetition.
These are not chatbot features. They are engineering constraints that the harness must enforce regardless of which model sits behind it. Swap the model tomorrow, and the harness guarantees the experience stays consistent.
The fragility of getting this wrong cannot be understated. Without proper grounding at the tool layer, the harness silently degrades: products that no longer exist surface in recommendations, constraints get ignored, and the agent confabulates details that erode trust with every interaction. A hand-rolled harness that works on demo day but lacks systematic evaluation will drift into unreliability within weeks. LLMs are stochastic; the harness must be deterministic where it counts.
What Separates Good from Great: Catalogue and World Knowledge
Enforcing constraints gets you a competent agent. What separates a competent agent from a genuinely useful one is access to knowledge that lives outside the API.
Catalogue knowledge means the agent understands the product graph. Not just titles and prices, but relationships between items, sizing systems, brand positioning, seasonal relevance, and complementary products. When a user asks for "something similar but warmer," the agent must navigate product attributes that may not be explicit in any single field.
World knowledge means the agent understands context that no product feed contains. It knows that Tokyo in April is mild but rainy, that a marathon runner prioritises moisture-wicking fabrics, that a wedding in Cape Town in December means summer not winter. This is the knowledge that transforms a search result into a recommendation.
These capabilities cannot be bolted on after the fact. They are architectural decisions about what the harness retrieves, how it structures context, and what tools the reasoning core can reach for. An LLM with access to a flat product feed will always produce flat answers. An LLM with access to a rich, structured world model produces answers that feel like genuine expertise.
Solenya's Philosophy: Engineering the Harness
Solenya MCP is not a chatbot. It is a harness built on four principles:
Multimodal by default. The harness operates across text, image, and structured data. Product discovery is inherently visual, and a user who says "something like this" while looking at an outfit is making a multimodal query. The harness handles this natively. It is not an afterthought or a feature flag.
Co-optimised with retrieval. The reasoning model and the retrieval model are not independent systems duct-taped together. They are co-optimised: the retrieval model understands what the reasoning model needs, and the reasoning model understands what retrieval can provide. This is what eliminates hallucination at the architectural level rather than relying on post-hoc filtering or prompt-level instructions to "only show real products."
World-aware. The harness has access to structured world knowledge: weather, events, cultural context, occasion norms. When a user asks for weather-appropriate clothing in Osaka, the system does not guess. It queries real climate data and adjusts retrieval accordingly.
Everywhere and for everyone. The harness exposes capabilities via MCP, which means any agent (Claude, GPT, a custom orchestrator, a customer's own toolchain) can access Zero Shot Discovery in seconds, not weeks of integration work. The same harness that powers human-facing chat powers machine-to-machine agentic commerce. The AI experience is integrated into the product surface, not hidden in a corner as a bolt-on experiment.
Solenya Chat in Action
The first scenario shows a straightforward outfit request. On the left, a competitor's chatbot stalls mid-query. On the right, Solenya's harness retrieves grounded catalogue results that match the aesthetic and fit constraints. No hallucinated products, no irrelevant padding.
The second scenario shows world knowledge in action. A user asks for weather-appropriate t-shirts for Tokyo next week. On the left, the competitor returns irrelevant sunscreen products, and when pressed on whether it actually knows the weather, it refers the user to visit an external weather site. On the right, Solenya's harness queries real climate data for Tokyo, determines conditions for the coming week, and retrieves seasonally appropriate clothing in a single grounded response.
The third scenario demonstrates constraint adherence. A user with a specific budget and weather conditions expects the agent to combine both constraints and return relevant products. On the left, the competitor provides generic weather advice (common sense, not tool-grounded) then fails entirely when asked for actual products, returning a "please try and rephrase" message. On the right, Solenya's harness queries real weather data, applies the budget as a hard constraint at the tool layer, and returns grounded results in a single turn.
For a human user, these feel like a capable personal shopper. For an agent, they feel like a well-structured API with real guarantees. An autonomous purchasing agent given all three constraints; aesthetic preference, weather conditions, and budget, can traverse the catalogue with confidence because the harness enforces every constraint at the system level.
Evaluating the Harness: An MCP Evaluation Framework
Building the harness revealed a second-order problem: how do you measure whether your harness is actually working? Adding tools and features without a concrete evaluation method is engineering in the dark.
We developed a standardised MCP evaluation framework that measures harness performance across grounding accuracy, token budgets, constraint adherence, context retention, and recovery quality. This framework is how we validate every change to our harness. Not by vibes, but by quantitative signals that map to user experience.
This framework will be elaborated in the next blog.
The Harness Is the Product
The rest of the market built chatbots. They wrapped a language model in a bubble, connected it to a product feed, and called it AI shopping. They focused on the chat and hid it in a corner of the page. They never thought about the harness.
That is why their agents are slow, wordy, hallucinatory, and brittle. The model is fine. The model was always fine. What is missing is the engineering layer that makes a reasoning core into a reliable, grounded, contextual shopping agent.
Solenya did not just engineer a harness. We engineered the full stack: a co-optimised retrieval model trained alongside the reasoning layer, MCP tools with structured grounding guarantees, a typed GraphQL API with real filtering semantics, multimodal input from day one, and an evaluation framework that catches regression before users do. Each layer reinforces the others. The retrieval model knows what the reasoning model needs. The tools enforce constraints the model cannot violate. The evaluation pipeline validates the entire chain, not just the chat output.
The result:
- Integration in seconds, not weeks. A single MCP endpoint replaces weeks of bespoke API integration.
- Context that persists. Multi-turn state, user preferences, and constraint accumulation are managed as infrastructure, not left to the model's memory.
- Grounded answers, guaranteed. Co-optimised retrieval eliminates hallucination at the architectural level, not with post-hoc filtering.
- World knowledge, not just catalogue knowledge. The agent understands your context, your climate, your occasion, not just your keywords.
Other e-commerce agents are search bars wearing a conversational skin. Solenya is full-stack discovery, harness-engineered from retrieval training through to the MCP tool layer, built for the agentic era.