Sub-grade spoke

Indexation Coverage — can agents discover every page that should be cited?

Q: Do AI agents actually use my XML sitemap?

Yes. [GPTBot](/learn/glossary#term-gptbot), [ClaudeBot](/learn/glossary#term-claudebot), and [PerplexityBot](/learn/glossary#term-perplexitybot) all read sitemaps when they're referenced from robots.txt — it's the cheapest way for them to discover the pages you actually want cited. A missing or broken sitemap forces agents to fall back on link-graph crawling, which misses any page not heavily inbound-linked from your own site.

Q: Where should the sitemap live?

`/sitemap.xml` at site root, referenced from `/robots.txt` as a `Sitemap:` directive. Multiple sitemaps are fine — split by section or by URL volume if you cross the 50,000-URL or 50MB limits per file, and reference a sitemap index from robots.txt instead.

Q: What's a canonical URL, and why do agents care?

A canonical URL declares which version of a page is the *primary* one when the same content is reachable at multiple URLs (tracking parameters, mobile variants, www vs. non-www). Agents use canonical tags to deduplicate citations — without them, the same content can fragment across multiple URLs and dilute the citation signal. 65% of mobile pages now declare canonicals, up from 51% in 2022.

Q: What happens if my sitemap lists pages that are also noindex?

The agent reads the sitemap, fetches the page, sees `noindex`, and drops it from consideration. The contradictory signal wastes crawl budget without adding visibility. Either remove the page from the sitemap or remove the noindex — pick one position and ship it.

Q: How does this differ from Bot Access Policy?

Bot Access Policy is *can the agent fetch any of your pages at all*. Indexation Coverage is *can the agent discover the specific pages you want cited*. Both fail silently and compound — a permissive robots.txt with a broken sitemap leaves agents fetching only your homepage and missing every product or article page.

An agent that can't find your page can't cite it. Indexation Coverage is the third Visibility layer — after robots.txt policy and CDN passthrough — that decides which of your URLs are reachable, canonical, and discoverable. The signal stack is well-known (XML sitemap referenced from robots.txt, canonical tags, no orphaned routes); the failure modes are subtle, and they compound across every downstream Clarity and Usability check.

By Chris Mühlnickel · 2026-05-16

What is Indexation Coverage?

Indexation Coverage is whether your important pages are discoverable through an XML sitemap referenced from robots.txt, declare canonical URLs, and aren't accidentally noindexed or orphaned from internal linking.

By the numbers

65% — of mobile pages now declare a canonical URL — up from 51% in 2022, equal across mobile and desktop. (HTTP Archive Web Almanac 2024 — SEO)
0 — of the major AI crawlers — GPTBot, ClaudeBot, PerplexityBot — execute JavaScript at fetch time. (Vercel — The rise of the AI crawler)
305% — year-over-year GPTBot crawl growth between May 2024 and May 2025 on Vercel's network. (Vercel — The rise of the AI crawler)

Why it matters

The Visibility sub-grade has three layers: policy (robots.txt), passthrough (CDN), and discoverability (sitemap + canonicals). The first two decide whether agents can reach any page; Indexation Coverage decides whether they can reach the right pages. AI agent crawl volumes are climbing fast — Vercel reports GPTBot crawl growth at 305% year-over-year through May 2025 — and the volume increase makes coverage gaps more expensive, not less. An agent that fetches 100 pages from your site and misses your pricing page cites a competitor's pricing instead, every time.

Sitemaps still carry the load, especially for sites without strong inbound linking. The XML sitemap referenced from robots.txt is the cheapest discoverability primitive on the web. It tells crawlers — including every major AI agent — here are the URLs I want indexed, with last-modified timestamps so you can prioritize freshness. The work is one-time configuration plus automated regeneration on deploys. Sites that ship a sitemap get the long tail crawled; sites without one are stuck on whatever the agent stumbles across through external links.

Canonical URLs are the deduplication signal that protects citation weight. Modern marketing stacks generate URL variants by accident — UTM parameters, mobile redirects, A/B test paths — and without a rel=canonical declaration each variant looks like a separate page to an agent. The result is fragmented citation weight: ten URLs splitting the credit one URL should carry. The Web Almanac's 65% canonical-declaration rate is a baseline; sites in the missing 35% leak agent attention on every URL parameter their stack invents.

The 0-JavaScript reality of AI crawlers makes server-rendered URLs non-negotiable. Vercel's analysis of 1.3 billion AI-crawler fetches found zero JavaScript executions — every major AI crawler consumed initial HTML only. This means routes only reachable via client-side router transitions, hash-based navigation, or JavaScript-injected <link rel=canonical> tags are functionally absent from the index. SSR is upstream of Indexation Coverage; without it, your sitemap is a list of URLs that resolve to empty shells for every agent that matters.

Where it's heading

Sitemap formats stretch to include agent-specific metadata. Today's XML sitemap carries URL, lastmod, changefreq, priority. Expect proposals for richer per-URL annotations — agent-mode hints, content-type signals, freshness commitments — that let agents prioritize within a sitemap rather than just walking it linearly. The current shape is solving 2010's discoverability problem, not 2026's.

llms.txt becomes the agent-native sitemap. XML sitemaps are designed for crawlers; llms.txt is designed for LLMs. Adoption is growing fast in the API-first cohort, and the convention is converging on llms.txt as the structured-prose complement to the XML sitemap — both shipped, both valuable, neither replacing the other.

Canonical management gets agent-aware tooling. Today, canonicals are a static HTML tag emitted by the CMS. Expect tooling that exposes which canonical conflicts are actively losing citations — this URL parameter generates 14 indexed variants, all without canonical tags, splitting your AI Overview citation weight across them — and the conversation flips from "is canonical configured correctly?" to "which canonical gap is costing the most revenue?"

Common mistakes

No sitemap, or a sitemap not referenced from robots.txt. The single line Sitemap: https://yoursite.com/sitemap.xml at the bottom of robots.txt unlocks crawl discovery for every well-behaved bot — leaving it out is pure foregone leverage.
Listing noindex pages in the sitemap. The contradictory signal wastes crawl budget without adding visibility — pick one position per URL and ship it.
Letting URL parameters generate uncanonicalized variants. Tracking parameters, A/B test paths, and faceted navigation all silently fork URLs, and without rel=canonical declarations each variant splits the agent citation signal.
Relying on JavaScript-driven routing for content pages. AI crawlers don't execute JS — routes only reachable through client-side router transitions are invisible no matter how good the sitemap is.
Updating the sitemap on every micro-edit. Frequent edits signal instability and waste crawler budget. Regenerate on real content changes, not on every deploy ceremony.

Frequently asked

Do AI agents actually use my XML sitemap?

Yes. GPTBot, ClaudeBot, and PerplexityBot all read sitemaps when they're referenced from robots.txt — it's the cheapest way for them to discover the pages you actually want cited. A missing or broken sitemap forces agents to fall back on link-graph crawling, which misses any page not heavily inbound-linked from your own site.

Where should the sitemap live?

/sitemap.xml at site root, referenced from /robots.txt as a Sitemap: directive. Multiple sitemaps are fine — split by section or by URL volume if you cross the 50,000-URL or 50MB limits per file, and reference a sitemap index from robots.txt instead.

What's a canonical URL, and why do agents care?

A canonical URL declares which version of a page is the primary one when the same content is reachable at multiple URLs (tracking parameters, mobile variants, www vs. non-www). Agents use canonical tags to deduplicate citations — without them, the same content can fragment across multiple URLs and dilute the citation signal. 65% of mobile pages now declare canonicals, up from 51% in 2022.

What happens if my sitemap lists pages that are also noindex?

The agent reads the sitemap, fetches the page, sees noindex, and drops it from consideration. The contradictory signal wastes crawl budget without adding visibility. Either remove the page from the sitemap or remove the noindex — pick one position and ship it.

Do I need to ping search engines when the sitemap changes?

Less than you used to. Google deprecated the sitemap ping endpoint in 2023; Bing followed. Modern crawlers re-read the sitemap on their own schedule, and frequent edits can actively hurt by signaling instability. Update on real content changes, not for ceremony.

What about orphan pages — pages no internal link points to?

If a page is in the sitemap but not linked from anywhere on the site, agents will fetch it but down-weight it. Internal linking is part of the signal stack — a page worth indexing is a page worth linking to from at least one other relevant page on the site. Spekto flags orphans in the audit output.

How does this differ from Bot Access Policy?

Bot Access Policy is can the agent fetch any of your pages at all. Indexation Coverage is can the agent discover the specific pages you want cited. Both fail silently and compound — a permissive robots.txt with a broken sitemap leaves agents fetching only your homepage and missing every product or article page.