Sub-grade

Visibility — can AI agents reach your website?

Q: What's the right robots.txt for an agent-friendly site?

Allow the standard search bots (Googlebot, Bingbot), allow major AI agent UAs (GPTBot, ClaudeBot, PerplexityBot, Google-Extended, Applebot-Extended, ChatGPT-User, Claude-User, Perplexity-User), allow social-preview UAs (FacebookExternalHit, Twitterbot, LinkedInBot), and allow all unspecified UAs by default. Block paths that genuinely shouldn't be indexed (admin areas, search-result pages with thin content). The detail matters: explicit `User-agent: GPTBot` blocks shouldn't appear unless you specifically intend to opt out of OpenAI's training and citation surfaces.

Q: Do AI Overviews respect `noindex`?

Yes. Pages with `noindex` in ` ` or as an HTTP `X-Robots-Tag` header are excluded from Google's index, which means they're excluded from AI Overviews. The implication: if you want a page cited by AI Overviews, it must be indexable. The corollary: pages you explicitly noindex (login, search-result, gated content) are not citation candidates. Most sites accidentally noindex more than they realize.

Visibility is the AAIO sub-grade covering whether AI agents can fetch your pages at all — robots.txt that allows the right user-agents, a CDN that doesn't silently block them, a sitemap they can discover, and (increasingly) an llms.txt that maps your site for them. Most sites quietly fail one of these. Three scored parameters covering bot access, CDN blocking, and indexation.

By Chris Mühlnickel · 2026-05-04

What is Visibility?

Visibility (in AAIO) is the set of access-layer signals — robots.txt policy, CDN bot policy, indexability, sitemap, llms.txt — that determine whether an AI agent can fetch your site's content at all.

By the numbers

~50% — of AI crawler traffic on Cloudflare's network in Aug 2025 came from ClaudeBot + GPTBot combined. (Cloudflare blog — AI crawler traffic by purpose and industry · 2025-08-28)
50,000:1 — Anthropic's crawl-to-refer ratio. OpenAI 887:1, Perplexity 118:1 — agents take far more than they send back. (Cloudflare blog — AI crawler traffic by purpose and industry · 2025-08-28)
~20% — of Googlebot's 4.5B monthly volume now comes from AI crawlers — GPTBot 569M, Claude 370M, PerplexityBot 24M. (Vercel — The rise of the AI crawler · 2024-12-17)

Why it matters

Access failures are silent — the agent doesn't tell you, it just doesn't cite you. Cloudflare's crawl-to-refer ratios make this brutally clear: when ChatGPT or Claude or Perplexity decides not to fetch your page, you don't see a 4xx in your logs (the request never happens). You don't see a missing citation in their output (you never know what they almost cited). You just see your competitors getting cited and don't know why. The first mover advantage of being scannable is invisibility-of-the-other-guys.

CDN bot-fight modes are well-intentioned and broadly correct, but often default to blocking AI agents. Cloudflare bot-fight, AWS Shield, Akamai bot manager — all designed for malicious bot mitigation, all configurable for AI agent access. The default rules are conservative, which means AI agents get blocked at the edge unless you explicitly allow them. Most site owners don't know this is happening; the symptom is the absence of citations, which is hard to attribute to a single config.

robots.txt is now a citation policy, not just a crawler policy. What you allow shows up in AI Overviews, ChatGPT Search, and Claude responses; what you block disappears from them. The reflexive 2023-era "block AI" decision was made by site owners who hadn't yet seen what allowing AI agents gives them: organic mentions, citations in answer surfaces, and zero-cost discovery from a fast-growing channel. Block decisions made today should be deliberate, narrow, and re-evaluated quarterly.

llms.txt is an emerging convention but a cheap, high-signal one. Adopters get a discoverability premium today before the convention is universal. The format is dead simple — a Markdown file at /llms.txt with a site summary, key URLs, and pointers to richer content. The cohort that ships it now (Stripe, Twilio, Linear, Notion) is exactly the cohort that wins agent-traffic share earliest. Adoption beats perfection.

SEO best practices carry forward to agents — but with sharper consequences when you fail them. XML sitemap, canonical URLs, server-side rendering, mobile-friendly responsive layouts. Each of these matters for human search and for agents. The difference: a slow human will reload; a JS-only critical content section that takes 4 seconds to render is a hard fail for an agent that doesn't execute JavaScript at all. Pre-existing SEO debt is amplified at the agent layer.

Sub-topics

The three scored parameters in the Visibility sub-grade

V-BOT Bot Access Policy — Does your robots.txt allow the major AI agent UAs? Spekto checks both Tier A (must-allow: ChatGPT-User, Claude-User, Perplexity-User, Claude-SearchBot, OAI-SearchBot, PerplexityBot, Google-Extended) and Tier B (training crawlers: GPTBot, ClaudeBot, CCBot, Bytespider, anthropic-ai, Applebot-Extended, AmazonBot, FacebookBot, cohere-ai). Tier A has a higher weight — blocking action-time crawlers removes you from citation surfaces immediately.
V-CDN Bot Blocking Detection — Does your CDN actually let AI agents reach your origin? Spekto runs identifiable-UA probes against your edge and detects silent blocking — common Cloudflare bot-fight, AWS Shield, and Akamai bot manager defaults that catch legitimate agents.
V-IDX Indexation Coverage — Does your site have a discoverable XML sitemap referenced from robots.txt? 99% of our calibration corpus passes this — the 1% that doesn't usually has a deliberate noindex setup gone wrong.

Where it's heading

Citation policy as a deliberate layer. The next iteration of robots.txt is per-bot, per-section allowlisting — "allow ChatGPT-User on /products/, allow GPTBot on /blog/, block both on /admin/." Cloudflare's AI Audit and similar tools are moving toward making this a first-class configuration surface rather than a robots.txt edit. Sites that treat citation policy as a deliberate product surface will get cleaner agent-routing than sites that treat robots.txt as a "set once, forget" file.

llms.txt approaching convention status. Adoption is small (17% in our corpus today) but the cohort is the right cohort — API-first and product-led SaaS leading. Expect adoption to cross 50% in mature B2B verticals by mid-2027.

Agent-specific user-agent declarations evolving. Today, "AI agent" is a coarse category covering training crawlers, search-time fetchers, browse-on-behalf-of-user agents, and computer-use agents — all distinct in purpose. Robots.txt and CDN tooling will get more granular, allowing per-purpose policy rather than per-UA-string. The framework will follow.

Common mistakes

*`User-agent: Allow: /` with overly aggressive Cloudflare rules canceling it.** robots.txt is permissive; the CDN above isn't. The site owner thinks they're agent-friendly; they're not. Our CDN-blocking check catches this, but most owners don't run the test.
Blocking GPTBot reflexively because it's "AI." Specific opt-outs make sense for some publishers; blanket blocks usually don't.
Not having a sitemap. Or having one that's not referenced from robots.txt. 99% of sites pass this; the 1% that don't are surprising.
SSR-skipping critical content. Hero text rendered after JS hydration. Pricing tables that load via API call after first paint. Reviews on a JS-only widget. All invisible to AI crawlers (the major training crawlers don't execute JavaScript).
Allowing every UA in robots.txt but missing recent additions. ChatGPT-User, Claude-User, and Perplexity-User were introduced in 2024; sites that wrote their robots.txt in 2022 are missing them by default unless they used a wildcard.

Frequently asked

Should I block GPTBot?

Almost never. Blocking GPTBot removes you from ChatGPT's training and citation surfaces with no upside for most sites. The reflexive 'block AI to protect content' move costs you Citation surfaces — ChatGPT, ChatGPT Search, and downstream tools that route through OpenAI — for no measurable gain. Specific exceptions: paywalled-content publishers with strict licensing constraints, or sites with strong contractual reasons to opt out. Even those should usually allow ChatGPT-User (browsing on behalf of a user) while blocking GPTBot (training).

What's the difference between blocking in robots.txt and blocking at the CDN?

robots.txt is a request — well-behaved crawlers honor it; misbehaving ones don't. CDN-level blocking is a hard wall — the request never reaches your origin server. They serve different purposes. The common failure mode is unintentional CDN-level blocking: Cloudflare's bot-fight mode, AWS Shield, or Akamai's bot manager apply default rules that catch legitimate AI agents. Spekto's CDN-blocking check probes for this — 25% of our calibration corpus fails it despite a permissive robots.txt.

Does Cloudflare's bot-fight mode block AI agents?

Yes, by default — and that's the most common cause of silent CDN-level blocking. The default behavior is conservative: it blocks anything that looks bot-shaped, including legitimate AI crawlers. The fix is per-zone configuration: explicitly allowlist the major AI agent UAs (GPTBot, ClaudeBot, PerplexityBot, ChatGPT-User, Claude-User, etc.) at the WAF / firewall rule layer, or disable bot-fight mode on the paths you want indexed. Cloudflare publishes guidance on this; the catch is most site owners don't know they need to apply it.

Is llms.txt actually used by anyone?

Yes, and adoption is growing fast. The cohort matters: API-first SaaS, developer tools, and content-rich sites are early adopters. Anthropic's Claude, OpenAI's ChatGPT browsing, Perplexity, and several IDE-integrated agents read llms.txt at site root. The cost of writing one is low (a Markdown file with site description + key page links); the upside is being legibly mapped for agents that prefer structured discovery to crawl-everything fallback. Even in Spekto's calibration corpus, the 17% adoption is up sharply from <5% in early 2025.

What's the right robots.txt for an agent-friendly site?

Allow the standard search bots (Googlebot, Bingbot), allow major AI agent UAs (GPTBot, ClaudeBot, PerplexityBot, Google-Extended, Applebot-Extended, ChatGPT-User, Claude-User, Perplexity-User), allow social-preview UAs (FacebookExternalHit, Twitterbot, LinkedInBot), and allow all unspecified UAs by default. Block paths that genuinely shouldn't be indexed (admin areas, search-result pages with thin content). The detail matters: explicit User-agent: GPTBot blocks shouldn't appear unless you specifically intend to opt out of OpenAI's training and citation surfaces.

Do AI Overviews respect `noindex`?

Yes. Pages with noindex in <meta name='robots'> or as an HTTP X-Robots-Tag header are excluded from Google's index, which means they're excluded from AI Overviews. The implication: if you want a page cited by AI Overviews, it must be indexable. The corollary: pages you explicitly noindex (login, search-result, gated content) are not citation candidates. Most sites accidentally noindex more than they realize.

How can I see what GPTBot, ClaudeBot, etc. actually fetched?

Server logs filtered by user-agent string are the canonical source — search for 'GPTBot/', 'ClaudeBot/', 'PerplexityBot/' patterns. Cloudflare, Fastly, and Akamai expose this in their analytics dashboards. Vercel publishes per-bot request volumes in their Web Analytics. The exercise is worth doing: most site owners are surprised by either how much or how little AI crawler traffic they actually receive.