Sub-grade spoke

Content Extractability — can agents read what you actually publish?

The major AI agent crawlers don't execute JavaScript. If your pricing, hero copy, product names, or critical content depends on JS to render, agents see an empty shell. The fix is server-rendering the parts that matter — same content humans see, available in initial HTML. Sites that get this right pull ahead on every downstream signal; sites that don't are invisible regardless of how good the rest is.

By Chris Mühlnickel · 2026-05-16

What is Content Extractability?

Content Extractability is whether your page's load-bearing content (text, prices, product names, hero copy, structured data) is present in server-rendered HTML — readable by agents that don't execute JavaScript.

By the numbers

Why it matters

Most agents don't execute JavaScript at fetch time. GPTBot, ClaudeBot, and PerplexityBot all crawl HTML and move on; they don't run your React app. Vercel's analysis of 1.3 billion AI-crawler fetches across its network found zero JavaScript executions — every major AI crawler consumed initial HTML only. The content you see in the browser after hydration is invisible to them unless it was also in the initial HTML response. Content Extractability is the prerequisite for every downstream Clarity check: Schema.org doesn't help if the schema is JS-injected, pricing doesn't help if the price loads on useEffect, reviews don't help if the widget hasn't hydrated yet.

The "but Google indexes JS" defense doesn't apply to AI agents. Googlebot does execute JavaScript — eventually, in a second-pass render that can lag the initial crawl by days. ChatGPT, Claude, and Perplexity don't. They take the initial HTML and move on. If your content depends on a JS framework to mount, you're optimizing for Google's second-pass while becoming invisible to the entire LLM ecosystem. The bet that "Google handles it" was reasonable in 2020; it ages badly as agent traffic crosses the line where it matters more than the second-pass SERP delta.

[CSR](/learn/glossary#term-csr)-only frameworks are the most common Clarity failure. Pure client-side React, Vue, or Angular apps that fetch data on mount produce empty initial HTML. The browser sees the right thing eventually; agents see <div id="root"></div>. Frameworks like Next.js, Nuxt, and Remix exist specifically to fix this — SSR is the right answer for any agent-relevant content. Even Claude's crawler, which fetches JavaScript files 23.84% of the time, never executes them — the fetch is for completeness, the value is still extracted from the HTML alone.

Pricing is the highest-stakes case. A request a demo for pricing page or a JS-rendered pricing widget is invisible to agents trying to compare options for users. Sites with machine-readable, server-rendered pricing are the ones that get cited in agent shopping queries; everyone else is routed around. The same pattern applies to product names, hero copy, and review widgets: anything the user reads, the agent must also read in the initial HTML response.

Where it's heading

Default to SSR-first patterns. Framework defaults are shifting — Next.js App Router defaults to React Server Components, SvelteKit defaults to SSR, Remix has always been SSR-first. Pure client-side rendering is increasingly the legacy choice, not the default. New projects in 2026 inherit agent-readability for free if they accept their framework's defaults.

Agent-aware ISR. Incremental Static Regeneration with agent-specific revalidation triggers — regenerate price pages every 5 minutes when AI traffic share exceeds X% — is an emerging pattern for high-stakes commerce pages. The signal is the same as the SEO freshness pattern, retuned for the agent retrieval interval.

Edge-rendering becomes standard. Cloudflare Workers, Vercel Edge, Fastly Compute@Edge all let SSR happen at the CDN layer rather than at a regional origin. Faster, cheaper, and still agent-readable. The performance gap between SSR and client-rendering has narrowed enough that the historical "CSR for speed" argument no longer holds.

Common mistakes

  • Trusting Google's JS-indexing as a proxy for agent visibility. Google's second-pass render doesn't help GPTBot or ClaudeBot — different agents, different behaviors, different visibility outcomes.
  • Server-rendering the shell but JS-loading the content. Header and footer SSR plus everything else client-side equals agents seeing the shell and missing the value. The hard part is the load-bearing content, not the chrome.
  • Using `useEffect` to fetch critical data. It runs client-side only; agents bail before it fires. Move to server-side data fetching (getServerSideProps, server components, Remix loaders) for anything that must be cited.
  • Letting hydration mismatches accumulate. Each hydration mismatch is a divergence between what agents see and what users see — and they tend to compound silently until citations drop.
  • Putting [Schema.org](/learn/clarity/schema-coverage) in a JS-injected `<script>` block. Even valid Schema is invisible if it's not in initial HTML. Schema and SSR are coupled — one without the other is worth less than the sum suggests.

Frequently asked

Does Google index JavaScript-rendered content?

Eventually, in a second-pass render that can lag the initial crawl by days or weeks. But AI agents — GPTBot, ClaudeBot, PerplexityBot — generally don't execute JS at all. Optimizing for Google's second-pass while losing AI visibility is a poor trade, especially as agent traffic monetizes faster than incremental SERP rank.

Do I need SSR for every page?

For content pages — product, pricing, marketing, blog, docs — yes. For authenticated app pages behind login: no, agents shouldn't see those anyway. Scope SSR to the public surface that needs to be cited or extracted; the gated surface can stay client-rendered without cost.

SSR vs. SSG — which?

Static generation (SSG) when content doesn't change per-request (blog posts, marketing pages). SSR when content changes per-request, though logged-in views shouldn't be agent-visible anyway. For e-commerce product pages with frequently-updated prices, use ISR (build plus revalidate on schedule) — same agent-readable surface, lower runtime cost.

My framework supports SSR but we're not using it. What's the migration cost?

Varies. Next.js to Next.js with getServerSideProps: hours to days per route. Create React App to Next.js: weeks. The migration usually pays for itself in extractability alone, separate from the Schema Coverage benefits that ride along. Run a Spekto audit to scope which routes are losing citations today so the migration sequence is data-informed.

How do I test what agents actually see?

curl -A "GPTBot/1.0" https://yoursite.com/page-url/ is the canonical test. View the raw HTML response. If your critical content isn't there, agents don't see it. Browser DevTools with Disable JavaScript also works for a quick visual check, but the curl approach is what agents themselves do.

What about hybrid client-server frameworks like Astro or Qwik?

They're agent-friendly by design — both ship server-rendered HTML by default and hydrate selectively client-side. Strong choices for content-heavy sites that don't need a full SPA runtime. The default behaviour gets Content Extractability right without specific configuration.

Does this apply to single-page apps that have already migrated to React Server Components?

Mostly yes — RSC ships server-rendered HTML, which is what agents want. The remaining failure modes are RSC components that defer to client components for load-bearing content (pricing widgets, dynamic feature comparisons). Spot-check those routes with the curl test above.