Sub-grade spoke

Bot Access Policy — does your robots.txt let AI agents in?

Q: Should I block GPTBot?

Almost never. Blocking [GPTBot](/learn/glossary#term-gptbot) removes you from ChatGPT's training AND citation surfaces — the reflexive 'block AI to protect content' move costs you visibility across one of the largest LLM ecosystems for no measurable gain. Specific exceptions: paywalled-content publishers with strict licensing constraints, or sites with strong contractual reasons to opt out. Even those should usually allow [ChatGPT-User](/learn/glossary#term-chatgpt-user) (browsing on behalf of a user) while blocking GPTBot (training).

Q: Do AI Overviews respect `noindex`?

Yes. Pages with `noindex` in ` ` or as an HTTP `X-Robots-Tag` header are excluded from Google's index, which means they're excluded from AI Overviews. The implication: if you want a page cited by AI Overviews, it must be indexable. The corollary: pages you explicitly noindex (login, search-result, gated content) are not citation candidates.

Your robots.txt is the policy layer that tells crawlers — including every major AI agent — whether they're allowed to fetch your pages. It looks like a small text file, but its actual job in 2026 is bigger: it decides whether ChatGPT, Claude, Perplexity, and the AI Overview pipeline see your site at all. Most sites fail this check accidentally — by blocking AI bots reflexively or by missing newer agent user-agents introduced in 2024-2025.

By Chris Mühlnickel · 2026-05-14

What is Bot Access Policy?

Bot Access Policy is whether your robots.txt explicitly permits the major AI-agent user-agents — both action-time agents (ChatGPT-User, Claude-User, Perplexity-User, etc.) and training crawlers (GPTBot, ClaudeBot, Google-Extended, etc.) — and is well-formed enough for those agents to interpret correctly.

By the numbers

~50% — of AI crawler traffic on Cloudflare in Aug 2025 came from ClaudeBot + GPTBot. (Cloudflare blog — AI crawler traffic by purpose and industry)
50,000:1 — Anthropic's crawl-to-refer ratio. OpenAI 887:1, Perplexity 118:1 — agents take far more than return. (Cloudflare blog — AI crawler traffic by purpose and industry)
25% — of top 1,000 sites block GPTBot today, up from 5% in early 2023. (Originality.AI — AI bot blocking tracker)

Why it matters

Robots.txt is the first thing agents check, and the last thing site owners think about. When an AI agent considers your site as a citation source or action target, the robots.txt fetch happens before any other request. Block the wrong UA — even by accident — and you're invisible to that platform's entire downstream surface. ChatGPT Search, AI Overviews, Perplexity, Claude.ai citations: all of these route through user-agent-keyed crawl policies that you set in this file. The cost of one stale Disallow: / rule is being absent from a citation surface used by hundreds of millions of users.

The reflexive "block AI" move costs more than it saves. A common 2023-2024 pattern was sites adding User-agent: [GPTBot](/learn/glossary#term-gptbot) + Disallow: / as a defensive measure against AI scraping. Two years on, the picture is clearer: blocking GPTBot removes you from ChatGPT's training and citation surface. Blocking ClaudeBot removes you from Claude's training and citation surface. The training-vs-citation distinction matters — most modern agents respect both, but blocking the training UA generally also drops you from the citation index that the training feeds. The right pattern for almost every public site is allow major AI bots, control specific paths. The exceptions (paywalled content, contractual restrictions) need explicit per-UA reasoning, not blanket blocks.

Newer agent UAs are the silent failure mode. ChatGPT-User, Claude-User, and Perplexity-User were introduced in 2024 to represent agent browsing on behalf of a user — distinct from training crawlers. Sites that wrote their robots.txt in 2022 or earlier are missing these by default unless they use a permissive User-agent: * wildcard. Worse, the reverse pattern (blocking with User-agent: * to opt out of all bots) silently catches the legitimate browsing agents that real users have asked to fetch the site for them — turning an "I don't want to be scraped" decision into a "users can't get answers about me" outcome.

The robots.txt → CDN gap is where most failures actually happen. Even a permissive robots.txt can be neutralized by an over-aggressive CDN bot policy (see also: Bot Blocking Detection). But the robots.txt step is the first measurable signal — and it's the one site owners actually control without a vendor escalation. Getting Bot Access Policy right is the cheapest, fastest, most reversible Visibility fix on the framework, which is why it sits at the entry point of the Visibility sub-grade.

Where it's heading

The action-time vs. training-time distinction will harden. Today the difference between agent fetching on behalf of a user (action-time) and agent fetching to build training data (training-time) is a soft convention. By 2027 expect standardized UA categories — possibly via the agents.json proposal or a robots.txt extension — that lets sites opt into one without the other with clear consequences signaled by each platform.

Robots.txt extensions for agent-specific signals. The current robots.txt syntax can't express this URL is safe for an agent to fetch but not safe to cite in a public answer or this URL is safe to index but please retry with backoff. Expect proposals for either inline directives or a sibling file (think agents.txt or policy.json) that captures these richer policies.

Platform-level "Allow AI bots" toggles become the default UX. Cloudflare's "Block AI bots" toggle launched in 2024 as a one-click opt-out; the inverse — "Allow AI bots" — is becoming the more important toggle as agent traffic monetizes. Expect platform-level UX in Cloudflare, AWS, Akamai, and Vercel that surfaces the AI-bot policy as a first-class config setting rather than a robots.txt edit.

Common mistakes

Blocking GPTBot reflexively because it's 'AI.' Specific opt-outs make sense for some publishers; blanket blocks usually don't. The cost is removal from ChatGPT's citation surface for content that would otherwise rank.
Allowing every UA in robots.txt but missing recent additions. ChatGPT-User, Claude-User, and Perplexity-User were introduced in 2024. Sites with 2022-vintage robots.txt files miss them by default unless they use a wildcard.
*Using `User-agent: + Disallow: /` as a placeholder.** This blocks everyone. Often inherited from staging configs that shipped to production without anyone reading the file.
Trusting robots.txt to protect security paths. Disallow: /admin/ is a routing hint, not access control. Sensitive paths need real auth at the application layer; robots.txt just tells well-behaved crawlers to skip them.
Forgetting to reference the sitemap. Sitemap: https://yoursite.com/sitemap.xml at the bottom is one line and unlocks crawl discovery for every well-behaved bot. Missing it is pure foregone leverage.

Frequently asked

Should I block GPTBot?

Almost never. Blocking GPTBot removes you from ChatGPT's training AND citation surfaces — the reflexive 'block AI to protect content' move costs you visibility across one of the largest LLM ecosystems for no measurable gain. Specific exceptions: paywalled-content publishers with strict licensing constraints, or sites with strong contractual reasons to opt out. Even those should usually allow ChatGPT-User (browsing on behalf of a user) while blocking GPTBot (training).

What's the difference between blocking in robots.txt and blocking at the CDN?

robots.txt is a request — well-behaved crawlers honor it; misbehaving ones don't. CDN-level blocking is a hard wall — the request never reaches your origin server. They serve different purposes. The common failure mode is unintentional CDN-level blocking: Cloudflare's bot-fight mode, AWS Shield, or Akamai's bot manager apply default rules that catch legitimate AI agents. See Bot Blocking Detection.

What's the right robots.txt for an agent-friendly site?

Allow the standard search bots (Googlebot, Bingbot), allow major AI agent UAs (GPTBot, ClaudeBot, PerplexityBot, Google-Extended, Applebot-Extended, ChatGPT-User, Claude-User, Perplexity-User), allow social-preview UAs (FacebookExternalHit, Twitterbot, LinkedInBot), and allow all unspecified UAs by default. Block paths that genuinely shouldn't be indexed (admin areas, search-result pages with thin content). Reference your sitemap at the bottom.

Do AI Overviews respect `noindex`?

Yes. Pages with noindex in <meta name='robots'> or as an HTTP X-Robots-Tag header are excluded from Google's index, which means they're excluded from AI Overviews. The implication: if you want a page cited by AI Overviews, it must be indexable. The corollary: pages you explicitly noindex (login, search-result, gated content) are not citation candidates.

Should I list every AI bot UA explicitly, or just use `User-agent: *`?

Wildcard is fine for the simple case ('everyone can access everything except admin paths'). Explicit per-UA blocks are only worth it when you want different policies per agent platform — e.g. allow action-time UAs but block training UAs. For most sites, wildcard + path-specific disallows is the cleanest.

How can I see what GPTBot, ClaudeBot, etc. actually fetched?

Server logs filtered by user-agent string are the canonical source — search for 'GPTBot/', 'ClaudeBot/', 'PerplexityBot/' patterns. Cloudflare, Fastly, and Akamai expose this in their analytics dashboards. Vercel publishes per-bot request volumes in their Web Analytics. The exercise is worth doing: most site owners are surprised by either how much or how little AI crawler traffic they actually receive.

My robots.txt looks right but the bots still aren't reaching us. Why?

Almost always a CDN-layer issue. Cloudflare bot-fight mode (default-on in many setups), AWS Shield, Akamai bot manager, and aggressive WAF rules can block legitimate bots before they reach robots.txt. Run a Spekto audit to identify silent edge-layer blocks — see Bot Blocking Detection.