Frontier spoke

MCP Tool Quality — does an LLM-as-reader know when to call your tools?

Shipping an MCP server is the easy part. Writing tool descriptions an LLM can actually reason from at runtime is the hard part — and the one most teams skip. Tool names, parameter schemas, and natural-language descriptions are what the agent reads to decide whether to call your server at all; the gap between the average MCP server and a great one is almost entirely description quality.

By Chris Mühlnickel · 2026-05-16

What is MCP Tool Quality?

MCP Tool Quality is whether the tools exposed by your [Model Context Protocol](/learn/agent-protocols/model-context-protocol) server have names, descriptions, parameter schemas, and error states that an LLM can read at runtime and reliably pick the right tool for the right task — as opposed to copy-pasted [OpenAPI](/learn/glossary#term-openapi) text written for human developers.

By the numbers

Why it matters

Tool descriptions are what the agent actually reads. When an agent loads your MCP server, it doesn't execute your code, doesn't read your docs, doesn't visit your website — it reads the JSON schema and the natural-language description for each tool schema you expose. That description is the entire signal. If it's vague, the agent skips your tool. If it's wrong, the agent calls it for the wrong reasons. If it's written for a developer reading OpenAPI ("createInvoice — creates an invoice"), the LLM-as-reader has nothing useful to reason from. The protocol layer is solved; the quality layer is where most teams lose.

The pattern most teams ship is "OpenAPI dumped through a converter." The path of least resistance is: take your existing API spec, run it through one of the open-source OpenAPI-to-MCP generators, ship the result. The output technically works — the server starts, agents can call it, the tools execute. What it doesn't do is help the agent decide when to call which tool. The descriptions are the same single-sentence ones that worked fine for human developers scanning docs and now fall flat for an LLM trying to pick a tool out of a registry of fifty. The pattern is so common that it's the default failure mode in our calibration corpus.

The arXiv research is brutal on current state. A February 2026 study of 10,831 MCP servers found 73% had repeated tool names — the single most common description smell across the ecosystem. The same line of research found that fixing description-level issues lifted LLM tool-selection accuracy by 11.6 percentage points, and full description augmentation across 856 tools improved end-to-end task success by 5.85 points. These are not marginal effects. They mean a sizable fraction of MCP servers in the wild are silently underperforming because the descriptions don't carry their weight.

Tool quality outweighs tool count, by a lot. Eight excellent tools beat fifty mediocre ones. Every tool in your server consumes context budget when the agent loads it, and every additional tool adds a slot the agent has to consider before picking one. Sparse tool sets with rich descriptions pick correctly; dense tool sets with sparse descriptions confuse. Most SaaS MCP servers ship the second shape because the temptation is to expose everything; the higher-leverage move is to curate.

This is where "ships MCP" diverges from "ships great MCP." Most SaaS vendors are now in the first bucket — they have an MCP server, it's published, agents can in theory reach it. The differentiation in 2026 isn't presence anymore; it's quality. The vendors winning agent traffic ship hand-curated tool descriptions, scope each server to a focused domain, document errors and side effects, and treat the MCP surface as a first-class product rather than a generated artifact. The gap is wide and visible.

Where it's heading

Tool-description linting becomes a CI gate. The arXiv "smell" framing — repeated names, missing when-to-call signals, undocumented errors, parameter-name inconsistency — is already getting tooled. Expect open-source linters that score MCP tool descriptions against the smell taxonomy and block CI merges when scores fall below threshold, the same way ESLint and stylelint became table-stakes for frontend code in the early 2020s.

Anthropic and OpenAI publish vendor-specific best-practices guides. Anthropic's "Writing effective tools for AI agents" engineering post (2025) is the first canonical document. Expect deeper guides — Claude-specific patterns, OpenAI-specific patterns, model-family-specific advice — as the vendors compete on agent-tool-using accuracy benchmarks like Berkeley Function Calling Leaderboard.

The MCP spec or a community extension adds description metadata. Today, a description is unstructured natural-language text. Tomorrow, expect structured fields: whenToCall, preconditions, sideEffects, errorClasses, exampleInvocation. The natural-language description stays, but the structured fields let agents reason more reliably and let linters check for completeness.

A public "MCP Quality Index" becomes a benchmark. Similar to the Web Almanac for SEO or the Lighthouse score for performance, expect a community-maintained ranking of public MCP servers by description quality, tool curation, and end-to-end task success. SaaS vendors track their score and use it competitively. Spekto's Content Hub captures this in the framework's Frontier watchers; the broader ecosystem is converging on the same idea.

Tool curation moves from one-time to continuous. Today, teams ship an MCP server, then forget it. The API drifts, the descriptions go stale, the tool set bloats with every new endpoint. Expect tooling that surfaces drift signals — tool-call success rates dropping, agents asking the same disambiguation questions repeatedly — and prompts the team to re-curate. Treating the MCP surface as a living product rather than a static artifact will be the default by 2027.

Common mistakes

  • Auto-generating MCP from OpenAPI without rewriting descriptions. The generated scaffold works mechanically — server starts, tools execute — but every description is the OpenAPI summary, which was written for human developers. Agents need when-to-call context, not what-it-does definitions. Rewrite every description, not just the ones you remember.
  • Shipping a 50-tool omnibus server instead of curated focused servers. Tool count is not a virtue. The arXiv research on 856 tools found augmenting existing descriptions beat adding tools every time. Cull aggressively, split by domain (one server for payments, one for customers, one for invoicing), and let agents pick the right server rather than the right tool out of fifty.
  • Method-style tool names instead of agent-readable verbs. customerCreate, invoiceList, subscriptionGet read like internal API methods, not instructions an agent can follow. Use verb-first, snake_case, agent-readable names: create_customer, list_invoices, get_subscription. The 73% repeated-name finding in the smell study suggests most teams aren't even consistent within their own server.
  • No error-state documentation. When the tool returns a 409, what does that mean? When it returns 422, should the agent retry with different parameters or abandon? Undocumented errors leave the agent guessing; the cost is silent failure and wasted retries. State error classes explicitly in every tool description, not just the happy path.
  • No example invocations for non-obvious tools. If a tool takes more than three parameters or has tricky type coercion (date formats, currency codes, ID prefixes), include a concrete example in the description. The agent reads the example and pattern-matches; without one, parameter formatting becomes a guess-and-retry loop that the agent often abandons.
  • Documenting parameters by type without semantic meaning. customer_id: string is type information the schema already carries. The description should add semantic meaning: `customer_id: the unique identifier of the customer, must begin with 'cus_' and be 16 characters'. Schema describes structure; description provides meaning.
  • Treating the MCP server as a generated artifact instead of a product. Sites that ship MCP and forget it accumulate drift — descriptions go stale, tools deprecate without notice, new endpoints don't get tool wrappers. Treat your MCP surface like any other API: changelog, version pin, deprecation notices, regular curation passes.

Frequently asked

What's the difference between an MCP server and a high-quality MCP server?

An MCP server is any service that speaks the Model Context Protocol — it lists tools, accepts calls, returns results. A high-quality MCP server has tools an LLM can pick correctly without a tutorial: clear when-to-call descriptions, sane parameter names, predictable error states, and a curated tool set rather than every endpoint dumped in. The protocol doesn't enforce quality; the LLM punishes the absence of it by skipping your server.

How many tools should my MCP server expose?

Fewer than you think. The arXiv research on 856 tools across 103 servers found that augmenting descriptions on existing tools beat adding more tools every time. Eight well-described tools usually beat fifty mediocre ones — the agent reads every tool's description at runtime, so noise costs you context budget and selection accuracy. Cull aggressively, especially for tools that overlap or duplicate.

Can I just auto-generate MCP tools from my OpenAPI spec?

You can generate the scaffold. You can't ship the result. OpenAPI descriptions are written for human developers reading docs — terse, schema-first, often single-sentence. LLM-as-reader needs the opposite: when to call this tool, what success looks like, what the side effects are, what errors mean. Generate the structure from OpenAPI; rewrite every description for the agent audience.

What does a good tool description actually look like?

Lead with when, not what. 'Creates an invoice for a customer' is what; 'Use when a user asks to bill a customer for a specific amount and you have the customer ID, amount, and currency; do not use for subscriptions or refunds' is when. Add the preconditions, the side effects, the expected response shape, and one example invocation if the parameters are non-obvious. The litmus test: would a colleague who knows the language but not your domain know when to call this?

Should my MCP tool names be method-style (camelCase) or verb-style?

Verb-style, agent-readable. create_invoice and list_customers read like instructions; invoiceCreate and customerList read like internal API methods. The arXiv smell study flagged repeated and inconsistent naming across 73% of surveyed servers as the top description issue — agents get confused when getCustomer and customers.fetch and retrieve_customer all appear in the same registry. Pick a convention, apply it uniformly.

Do I need to document errors and side effects in the tool description?

Yes — both. Agents retry on failure; an undocumented error means the agent has no way to decide whether to retry, escalate, or abandon. Undocumented side effects mean the agent may call a destructive tool thinking it's a read. State error classes explicitly ('returns 409 if the customer already has an open invoice'), and label side-effectful tools clearly ('this tool modifies billing state and cannot be undone').

How do I tell whether my MCP server's tool quality is good enough?

Three quick checks. First, give a fresh LLM your tool list (no other context) and ask it to pick a tool for a realistic user request — if it picks wrong or asks for clarification on basic preconditions, your descriptions are too thin. Second, look for tool-name collisions or near-duplicates — if you have any, prune. Third, count tools — over twenty tools is a signal you need to split into multiple focused servers rather than one omnibus surface.