Frontier hub

Agent Intelligence — retry-safety and CUA targetability

Agent Intelligence is the AAIO Frontier category covering whether your site is built for agents that retry safely, expose stable click targets, and behave predictably under non-deterministic interaction. The two biggest signals: idempotency (Retry Safety) and stable, semantic UI structure (CUA Targetability). Both compound on existing accessibility and engineering investments.

By Chris Mühlnickel · 2026-05-04

What is Agent Intelligence?

The Frontier category covering retry-safety, action-layer determinism, and semantic UI signals that let AI agents reliably navigate, retry, and reason about your site's state.

By the numbers

22.0% — Anthropic Computer Use on OSWorld benchmark (October 2024) — 2× the previous best AI system (Anthropic — Introducing computer use)
24h — Stripe persists Idempotency-Key responses on POST endpoints — same key returns same status code and body even after 500s (Stripe API documentation)
100s — of steps an agent takes to complete a single user task (Anthropic — Introducing computer use)

Why it matters

Agents do retry — and without idempotency, they create real money loss. This is the cleanest "ship this now" Frontier signal in the framework. The cost of supporting Idempotency-Key headers is a small server-side change; the cost of not supporting them is duplicate orders, double charges, double bookings, and silent data corruption. Stripe's pattern has been the de facto standard for nearly a decade — there's no engineering excuse not to adopt it on payment-adjacent endpoints, and the pattern extends cleanly to non-payment POSTs as well. We've seen sites in our calibration corpus where the agent-traffic equivalent of a chargeback is a duplicate-order ticket from the customer — same problem, different surface.

Computer-use agents navigate by visual + DOM cues, and your site's UI stability matters more than it did pre-2025. The current CUA generation (Anthropic Computer Use, OpenAI Operator, Google Project Mariner) is improving fast — the OSWorld benchmark has more than tripled in 18 months. But CUAs are sensitive to UI churn: a button selector that depends on a CSS-modules-generated class hash breaks when the page rebuilds; a focus-trap modal locks the agent in the same way it locks a keyboard user. Sites built for single-user-one-tab break under multi-agent or retry conditions; sites built with stable selectors and clean state machines work.

The action layer is increasingly non-deterministic, and your site needs to expect that. Multiple agents might act on the same flow concurrently. A single agent might retry the same POST three times. The user might be running a Computer Use agent and a Claude tool-call simultaneously. None of these are edge cases — they're the new default. Sites built for "one user, one tab, one click" silently break here.

Idempotency is a reusable engineering primitive. The work you do to support agent retries — idempotency keys, structured error responses, machine-parseable status — also protects you against your own client retry logic, network blips, and human users hitting the back button after a slow checkout. The investment is durable.

CUA targetability is structurally about stable, semantic HTML. Which is also good for accessibility, also good for SEO, also good for your own end-to-end test suite. Cumulative work, low downside. The list of things you'd do only for CUA-friendliness is short: maybe a few data-testid attributes, maybe a couple of ARIA landmark roles you'd skipped. The rest is hygiene you'd want anyway. A newsletter form with hashed CSS-modules class names (sc-bdVaJa fXrxNw) and a button labelled → is CUA-hostile in the same way it's screen-reader-hostile; the same form with aria-label="Newsletter signup", a real <label> for the input, and a button that says Subscribe works for both. CUA targetability is a forcing function for hygiene that helps everyone.

Sub-topics

Frontier watchers (tracked, not yet scored)

F-RTRY Retry Safety — Does your service support Idempotency-Key headers (or equivalent) on POST endpoints? Are error responses structured with HTTP status codes + machine-parseable error codes? Does application state recover gracefully from a partial failure?
F-CUA CUA Control Targetability — Are your interactive controls reachable by stable, semantic selectors? Are modals accessible (no focus traps, no overlay-only triggers)? Do form fields have predictable labels that survive a page rebuild?

Where it's heading

Idempotency promotes from Frontier to scored. When agent traffic share crosses a threshold (likely 5-10% of POST volume on a representative sample), retry-safety stops being a Frontier "watch" and becomes a scored Usability parameter with a power-cap. Sites that haven't shipped idempotency support by then take an overall-grade hit. Plan accordingly. F-RTRY is the most likely Frontier watcher in the framework to promote to a scored Usability parameter.

CUA evals improving fast — accessibility tooling and CUA tooling converging. The OSWorld curve is moving up sharply; web-arena, agent-arena, and similar benchmarks are tracking the same trend. As CUAs get better, their reliability ceiling becomes the site's UI stability rather than the agent's model quality. Sites that are good for screen readers will be good for CUAs; sites that fail accessibility will increasingly fail CUA targetability too. The two evaluation programs are converging.

Action-layer protocols emerging. Today, "is this action a duplicate?" is solved per-endpoint via idempotency keys. The next layer — multi-agent coordination (when two agents act on the same flow), action confirmation (the agent asks the user to confirm a high-stakes action), partial-failure recovery (the agent rolls back on its own when something goes wrong) — is in active development across the same vendor coalition that ships MCP and A2A.

The "self-healing UI" pattern. Sites that detect when a retry is in flight and surface clearer state ("we're processing your previous request, please wait...") are emerging as a high-leverage UX pattern. Stripe Checkout does this already; the rest of e-commerce will follow.

Common mistakes

No idempotency on POST endpoints. The single most common failure mode. Customer-facing duplicate-order tickets are the visible symptom.
Selectors that depend on auto-generated class names. CSS modules, Tailwind JIT, styled-components — all generate hashed class names that change every build. Use data-testid or stable semantic class names for anything an agent might target.
Modal-heavy UI without semantic role markup. Agents and screen readers both lose track of focus inside modals that don't declare role='dialog' and manage focus correctly.
Generic error messages. 'Something went wrong' gives the agent no signal. Use HTTP status codes correctly and include a machine-parseable error code in the body.
Retry logic that creates duplicate orders. Your own client-side retry, your CDN's retry, the agent's retry — all stack. Without idempotency, you triple the duplication risk.

Frequently asked

What is idempotency and why do agents need it?

Idempotency is the property that performing an operation more than once produces the same result as performing it once. Agents do retry — network errors, ambiguous responses, partial failures all trigger retries inside an agent's loop. Without idempotency support, a retry creates duplicate orders, double charges, double bookings, or corrupted state. Stripe's Idempotency-Key header pattern is the de facto standard: the client generates a unique key per intent, the server returns the same response for the same key, and retries become safe by construction.

How do I support `Idempotency-Key` headers correctly?

Three rules: (1) Accept the header on all POST endpoints (GET and DELETE are already idempotent). (2) Persist the key alongside the response status code and body for at least 24 hours — Stripe's standard window. (3) Return the same response for the same key, including 5xx errors, so a client retry doesn't double-execute the action. The implementation is small but the failure mode without it is ugly: silent duplicate writes that show up days later as customer complaints.

What's a computer-use agent? How is it different from an API agent?

A computer-use agent (CUA) interacts with software via screenshots, mouse clicks, keyboard input, and DOM navigation — like a human would. The current generation includes Anthropic's Computer Use, OpenAI's Operator, and Google's Project Mariner. An API agent calls structured endpoints; a CUA navigates UIs. CUAs are the bridge for sites that don't (or can't) expose APIs: they let agents complete actions on legacy software, dashboards, booking systems, and consumer products that lack programmatic interfaces.

Should I add `data-testid` or other agent-specific markers?

Mostly no — the right answer is good semantic HTML and ARIA, which works for both agents and accessibility users. data-testid is fine where you already use it for your own QA tests, and it doesn't hurt agents to encounter, but adding it specifically for agents is usually a lower-leverage move than fixing the underlying selectors. Stable class names, predictable form labels, no random hash-suffixed CSS classes — those compound across CUA reliability, accessibility, and your own end-to-end test suite.

Do agents handle partial failures gracefully?

It depends entirely on what the site tells them. An agent retrying a payment that returned a 502 needs to know whether the payment succeeded or not — a generic 'Something went wrong' error gives no signal, and the agent has to choose between abandoning the user's purchase and risking a duplicate charge. Idempotency at the server side and structured error responses (HTTP status code + machine-parseable error code + human-readable message) at the response side are what makes graceful partial-failure handling possible.

How do I test my site against computer-use agents?

Three practical tests: (1) Run Anthropic's Computer Use API against your top-five user flows and watch for stuck states. (2) Try the same with OpenAI's Operator. (3) For each failure, identify whether the cause is structural (focus-trap modal, JS-only button, random class names) or content (ambiguous error, hidden state). Most failures are structural, which means they're also accessibility failures and worth fixing on the human side.

Is CUA targetability the same as accessibility?

Substantially overlapping but not identical. WCAG-AA compliance gets you most of the way to CUA targetability — semantic HTML, ARIA roles, keyboard navigability, focus management, predictable form labels. The CUA-specific delta is mostly about visual stability: random class names, layout shifts, and animation-heavy UIs hurt CUAs more than they hurt screen readers. The good news: investing in accessibility is the highest-ROI CUA-readiness work you can do.