The shift from single-turn chatbots to multi-step agents has quietly broken traditional evaluation. When an agent uses the Model Context Protocol to call tools, the path to an answer matters as much as the answer itself. The path includes which tools the agent picks, how it forms arguments and how it recovers from errors. This is all missed by end-state grading.

This is not a new observation. AgentBench ¹ showed that models excelling on static benchmarks often fail in tool-using environments, with poor long-term reasoning and weak multi-step instruction following as the primary bottlenecks. What was missing was a principled way to evaluate agents as they actually operate: multi-turn, tool-mediated, and route-dependent.

From Answer Grading to Trajectory Evaluation

Early agentic evals tried exact-match JSON parsing. Failing an agent because it formatted a payload as {"price": 199.99} instead of {"price": 200}, when both were accepted upstream, proved too brittle. The field has since moved toward deep trajectory evaluation: scoring the full sequence of tool calls and their outcomes, not just the final response.

MCPEval ² formalised this split between syntax (tool call matching) and semantics (LLM judge analysis). TRAJECT-Bench ³ introduces metrics for tool dependency chains and explicitly states that evaluating the final answer alone is insufficient.

The practical obstacle is raw MCP transport logs: highly verbose, full of protocol metadata, difficult for an LLM judge to parse reliably. The solution is a two-stage approach. A deterministic normalization script intercepts the logs first, producing a clean Evidence Trace: just the sequence of tool calls, arguments, success/failure status, and result previews. The judge then evaluates only this normalized trace, keeping mechanical execution separate from semantic scoring. This draws from profile-aware LLM-as-a-judge work ⁴, where structuring the context presented to a judge, rather than feeding it raw noisy input, improves evaluation reliability.

How We Ran the Sweep

We built a four-stage pipeline orchestrated with DVC: persona generation, conversation simulation, trace normalization, and trajectory judging.

Grounded synthetic personas. Each run generates 30 synthetic shoppers per product catalogue. Before profile creation, the pipeline issues broad search_catalog calls against the live MCP server and injects extracted categories, brands, and price ranges into the prompt. Persona psychographics are structured around VALS, OCEAN, JTBD, and benefit-segmentation frameworks, with conversation depth sampled from 2–10 turns. Agents cannot succeed by parroting a product name buried in the profile.

Two-agent simulation. A persona agent plays the shopper. The MCP agent under test connects to a specific route, receives server instructions via the standard initialize handshake, and decides when and how to call tools. Failed tool calls are retried with backoff; a conversation is marked failed only after repeated consecutive failures, so transient errors do not silently pass as success.

Experimental sweep. The core comparison is a paired variant sweep: each cell combines one MCP endpoint, one matching agent prompt, one simulation model, and one catalogue index. We evaluate three retail catalogues across four simulation models (Ministral-3B, Qwen3-8B/14B/32B) and eight MCP routes. The routes are deliberately additive: the search route exposes only search_catalog, filtered adds constraint-based filtering on top of that, and full with facets layers in facet discovery and the remaining tools. Utility routes like weather and calculator test general tool-calling behaviour outside the product domain. The GraphQL route sits apart from this ladder entirely. That yields hundreds of cells, each with 30 conversations.

The GraphQL route is the deliberate stress case, inspired by the Apollo GraphQL pattern of exposing a single flexible query interface. It exposes a single graphql_query tool against the same underlying catalogue that other routes serve through curated tools like search_catalog and get_catalog_facets. Same data plane, different interface contract.

Trace normalization. A deterministic script distills each conversation into an Evidence Trace: a chronological sequence of tool actions, each recording the tool name, parsed arguments, success/failure, error messages, result counts, brand/type signals, and a compact product preview. Budget, category, and brand constraints are snapshotted at each step so the judge can assess whether retrieval matched intent.

LLM-as-judge scoring. An LLM judge (Nemotron 3 Super via OpenRouter) scores each normalized trace on three integer dimensions (1–3):

Dimension	What it measures
Need match	Did tool queries and returned previews fit the shopper's stated constraints across turns?
Tool execution	Quality of tool choice, argument formation, and recovery from empty or off-target results.
Trust / grounding	Are assistant claims supported by tool evidence, without fabricated product details?

The rubric treats search_catalog result previews as canonical evidence of what the shopper saw, penalizes redundant in-chat product catalogs, and rewards constraint-aware reformulation over fluent prose. The judge runs at temperature 0. Per-cell scores are bootstrapped into confidence intervals and rolled up into route × model × catalogue tables.

The Results: Interface Contract Dominates

The sweep split cleanly. Curated MCP routes produced usable shopping agents. The GraphQL route did not.

Grouped bar chart showing agent overall scores across eight MCP routes and four model sizes. The GraphQL (apollo) route clusters near 1.15 across all models while curated routes reach 1.8–2.1. — Overall score by route and model size (95% BCa bootstrap CI). Apollo sits at the rubric floor regardless of model. Curated routes reach ~2.0 at 8B.

The GraphQL route failed across the board. Qwen3 models averaged roughly 1.14–1.19 overall on the GraphQL route versus ~2.0 on search and full with facets routes. Agents did not learn the schema; they guessed it, inventing field names, nesting, and filter types until queries failed on the first tool turn. Need match and tool execution sat at the rubric floor across 8B, 14B, and 32B. In the worst cells every conversation failed; even the best cells lost roughly half. The fundamental issue is that open-ended GraphQL authoring against an underspecified interface is a schema-memorization task, not a reasoning task. Smaller models have no basis for memorizing a schema they have never seen.

Bigger models did not fix this. Qwen3-32B on the GraphQL route was no better than 8B. The bottleneck is the interface contract, not context window size. No amount of additional capacity rescues an agent that must hallucinate a schema on every turn.

Curated tools changed the picture. search_catalog and related operations give agents a small, typed contract instead of a schema to hallucinate. Qwen3-8B on search-only MCP reached ~2.0 with zero conversation failures, credible discovery at a fraction of frontier-model cost.

Scale bought little once the interface was right. Search peaked at 8B and declined slightly beyond that; full with facets continued improving to 14B before plateauing. Either way, the main jump is 3B → 8B — scaling from 8B to 32B returns little. Better tool design moved the needle far more than bigger models. Ministral-3B still struggled everywhere, so there is a floor, but that floor is 8B, not 32B.

Line chart showing how overall score changes from 3B to 8B to 14B to 32B parameters for three routes. GraphQL (apollo) is flat near 1.15 across all sizes. Search peaks at 8B and declines slightly. Full with facets peaks at 14B and plateaus. — Effect of model scale on agent performance. GraphQL stays at the rubric floor regardless of model size. On curated routes the meaningful jump is 3B → 8B; scaling beyond that returns little.

The practical lesson is a modular ladder. Search-only MCP is a strong, cheap baseline. Adding filters or the full with facets toolset improves specific capabilities like constraint-aware filtering and richer discovery, rather than rescuing broken agents. You compose the surface to fit the product requirements; you do not ask the model to write GraphQL from imagination.

Bottom line: raw GraphQL in MCP is a stress test, not a production interface. Curated MCP tools are what make mid-size models work. Past 8B, invest in tool design before model scale.

Liu X, Yu H, Zhang H, Xu Y, Lei X, Lai H, et al. Agentbench: Evaluating llms as agents. In: International Conference on Learning Representations. 2024. p. 52989–53046.

Liu Z, Qiu J, Wang S, Zhang J, Liu Z, Ram R, et al. Mcpeval: Automatic mcp-based deep evaluation for ai agent models. In: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. 2025. p. 373–402.

He P, Dai Z, He B, Liu H, Tang X, Lu H, et al. TRAJECT-Bench: A Trajectory-Aware Benchmark for Evaluating Agentic Tool Use. arXiv preprint arXiv:251004550. 2025;

Fabbri F, Penha G, D’Amico E, Wang A, De Nadai M, Doremus J, et al. Evaluating podcast recommendations with profile-aware llm-as-a-judge. In: Proceedings of the Nineteenth ACM Conference on Recommender Systems. 2025. p. 1181–1186.