Frontier spoke
Computer-Use Agents — is your UI targetable by Anthropic Computer Use, OpenAI Operator, and Project Mariner?
Computer-use agents — Anthropic Computer Use, OpenAI Operator, Google Project Mariner — navigate your site the way a human would: screenshots, mouse clicks, keyboard input, DOM traversal. They're the agent layer for sites without APIs. CUA targetability is whether your UI gives them stable selectors, semantic role markup, predictable focus management, and machine-parseable error states. It is substantially the same work as accessibility — and the same investment pays both audiences.
By Chris Mühlnickel · 2026-05-16
What is Computer-Use Agent Targetability?
Computer-Use Agent (CUA) Targetability is whether your interactive UI exposes stable, semantic selectors; manages focus predictably across modals and route transitions; emits machine-parseable error states; and avoids the bot-fingerprinting and selector-instability patterns that break legitimate CUAs.
By the numbers
- 72.5% — Claude Sonnet 4.6 on OSWorld-Verified — up from 14.9% (Sonnet 3.5) sixteen months earlier (Anthropic)
- 58.1% — OpenAI Operator's Computer-Using Agent on WebArena — state-of-the-art browser-task score at launch (OpenAI)
- 10× — browser tasks Project Mariner can now run in parallel from cloud VMs (TechCrunch (covering Google I/O 2025))
Why it matters
Computer-use agents are the bridge for everything without an API — and that's most of the web. The same site that ships a beautiful structured-data layer for retrieval agents may have no machine-callable surface at all for the agent that needs to do something on the user's behalf. Booking a flight, filing a claim, updating an account, adding a guest to a calendar event: these flows live in UI, and the agent layer that reaches them is the CUA layer. A site that's hostile to computer-use agents silently disqualifies itself from the action surface, even when its retrieval surface is excellent.
The CUA generation is improving fast — and the reliability ceiling is shifting to the site. The OSWorld scoreboard tripled in eighteen months: Anthropic's Claude Sonnet 3.5 scored 14.9% in October 2024; Sonnet 4.6 scores 72.5% on OSWorld-Verified today. OpenAI's Operator launched at 58.1% on WebArena. Google's Project Mariner now runs 10× browser tasks in parallel from cloud VMs. As the agents close the gap on human-level reliability, the bottleneck moves from "is the model smart enough?" to "is the site's UI stable enough?" — which means site-side investment compounds faster than it used to.
Random-hash class names break CUAs the same way they break QA test suites. CSS Modules, Tailwind JIT, styled-components, Emotion — every modern CSS-in-JS toolchain generates class names like sc-bdVaJa fXrxNw that change every build. A CUA that learned "click the button with class fXrxNw" yesterday can't find it today; a QA test suite hits the same wall. The fix is the same in both cases: stable data-testid attributes, real ARIA labels, semantic element types. The investment compounds across CUA reliability, screen-reader compatibility, and your own end-to-end test infrastructure.
Focus-trap modals and modal-heavy UI lock CUAs the same way they lock keyboard users. A modal that renders an overlay but never declares role="dialog", never sets aria-modal="true", never traps tab focus to its interior, and never restores focus to the trigger on close — that modal is CUA-hostile, screen-reader-hostile, and brittle to test. The pattern is so common that it's the second-most-cited CUA failure mode in Anthropic's developer reports.
Investing in accessibility is the highest-ROI CUA-readiness work. The list of things you'd do only for CUA targetability is short: a few data-testid attributes, maybe a couple of ARIA landmark roles you'd skipped, perhaps a longer error message with a machine-parseable code. Everything else — semantic HTML, ARIA roles, keyboard navigability, focus management, predictable form labels, stable selectors — is hygiene that helps every audience. CUAs aren't asking you to build a new layer; they're surfacing the cost of the layer you skipped.
Sites built for "one user, one tab, one click" silently break under retry, multi-agent, and parallel scenarios. A back button that re-submits the previous POST creates a duplicate when a CUA navigates back to recover from a misread state. A flow that depends on session-stored intermediate data breaks when two agents land on it concurrently. A form that throws away input on a soft validation error costs the agent a full re-walk of the funnel. The single-actor assumption is the new fragility; CUA testing surfaces it cheaply, before it becomes a production incident.
Where it's heading
OSWorld and WebArena scores keep climbing — diminishing returns approaching. The current trajectory (14.9% → 72.5% in sixteen months on OSWorld-Verified) is unsustainable indefinitely; we should expect the curve to bend over the next year as the easy reliability gains get harvested. What replaces "is the model smart enough?" as the gating factor is the site's UI stability — which means site-side investment in semantic HTML, stable selectors, and accessibility tooling becomes the lever that determines how often a CUA actually succeeds.
Accessibility tooling and CUA tooling converging into one evaluation program. axe-core, Lighthouse, WAVE, and the rest of the a11y tooling ecosystem already test most of what a CUA needs: semantic HTML, ARIA roles, focus management, contrast, keyboard navigability. Expect explicit CUA-readiness rules to land in these tools over the next 12-18 months — likely as new axe-core rule packs, Lighthouse audits, or first-class CUA scorecards. The two evaluation programs converge into one site-side report.
"Self-healing UI" patterns: the [Retry Safety](/learn/agent-intelligence/retry-safety) parallel. Stripe Checkout already detects when a previous request is in flight and shows "we're processing your previous payment, please wait..." rather than letting the user resubmit. The pattern generalizes to CUAs: detect that an automated session is in the funnel, surface clearer state messages, gate destructive actions behind an explicit confirmation step. Expect e-commerce, banking, and high-stakes-transaction sites to adopt the pattern over the next 18 months.
Action confirmation and multi-agent coordination protocols emerging. Today the agent decides on its own when an action is high-stakes enough to ask the user. The next layer — standardized "this action requires human confirmation" markers in the page, multi-agent coordination when two agents land on the same flow — is in active development. The browser layer is the likely host (WebMCP, CUA-aware accessibility APIs), and Microsoft's Agentforce-style platforms are pushing the same primitive from the agent side.
Common mistakes
- Random-hash class names with no stable selector fallback. CSS Modules, Tailwind JIT, and styled-components all generate class hashes that change every build. Add
data-testidattributes, real ARIA labels, or stable semantic class names for anything an agent (or your own test suite) might target. - Focus-trap modals that don't declare `role="dialog"` and don't manage focus on open/close. The pattern locks CUAs the same way it locks keyboard users. Declare the role, set
aria-modal="true", trap tab focus to the modal interior, and restore focus to the trigger element on close. - Generic error messages with no machine-parseable code. "Something went wrong" tells the agent nothing about whether to retry, abandon, or escalate. Pair human-readable error text with a stable machine-parseable error code (
payment.card_declined,inventory.out_of_stock) and correct HTTP status — same pattern as Retry Safety. - Modal-heavy UI without `role` markup or semantic element types. Custom
<div onClick>controls withoutrole="button", fake dropdowns built from styled<span>elements, and ARIA-free disclosure widgets all force the agent into pixel-coordinate guessing instead of semantic reasoning. - Retry-creating UI patterns. A back button that re-submits the previous POST, a refresh that re-fires a non-idempotent action, a soft validation error that throws away user input — all of these are CUA-hostile because the agent's recovery loop naturally exercises them. Make idempotent operations the default; gate side-effecting actions behind explicit confirmation.
- Bot-fingerprinting rules at the CDN that catch legitimate CUAs. Cloudflare bot-fight mode, AWS Shield default rules, and Akamai Bot Manager flag automated browsers — including legitimate user-driven CUAs running in cloud VMs. See Bot Blocking Detection. Allowlist the major CUA user-agents and IP ranges, or surface them to a CAPTCHA the agent can defer to the user.
- Animation-heavy or layout-shift-prone UI that breaks screenshot-based reasoning. CUAs reason from screenshots; a UI that shifts elements mid-render, animates continuously, or hides controls behind hover-only triggers degrades the agent's ability to plan a click. Stable layouts, no surprise reflows, no hover-only interaction — all standard accessibility hygiene that pays the CUA audience too.
Frequently asked
What is a computer-use agent?
A computer-use agent (CUA) is an AI agent that interacts with software through screenshots, mouse clicks, keyboard input, and DOM navigation — the same surface a human user touches. The current generation is led by Anthropic's Computer Use (announced October 2024), OpenAI's Operator (January 2025), and Google's Project Mariner (Google I/O 2025). CUAs are the bridge for sites that don't expose APIs — they let agents drive legacy software, dashboards, booking systems, and consumer products that never planned for programmatic interaction.
How is CUA targetability different from API agent readiness?
An API agent calls structured endpoints; a CUA navigates UIs. The skills don't transfer cleanly: a site with a beautiful OpenAPI spec can still be CUA-hostile if its UI uses random-hash class names, focus-trap modals, and JS-only buttons. Conversely, a site with no API at all can be highly CUA-friendly if it has clean semantic HTML and stable selectors. CUA targetability is its own evaluation, and for many sites it's the only agent surface they have.
Is CUA targetability the same as accessibility?
Substantially overlapping but not identical. WCAG-AA compliance gets you most of the way: semantic HTML, ARIA roles, keyboard navigability, focus management, predictable form labels. The CUA-specific delta is mostly visual stability — random-hash class names from CSS modules or styled-components break CUAs more than they break screen readers, and aggressive layout shifts confuse the agent's screenshot-based reasoning loop. The good news: investing in accessibility is the highest-ROI CUA-readiness work you can do.
Why do CUAs break on auto-generated class names?
CUAs target elements with a mix of visual cues (screenshot → click coordinates) and DOM cues (CSS selectors, ARIA labels, element text). When your build system generates a button class like sc-bdVaJa fXrxNw and rebuilds the hash every deploy, the agent's selector from yesterday no longer matches today's DOM. This is the same fragility QA test suites hit, which is why the solution is the same: stable data-testid attributes, real ARIA labels, semantic element types (<button>, not <div onClick>).
How do I test my site against computer-use agents?
Three practical tests: (1) run Anthropic's Computer Use API against your top-five user flows and watch where the agent gets stuck. (2) Try the same flows with OpenAI's Operator. (3) For each failure, identify whether the cause is structural (focus-trap modal, JS-only button, hashed class names) or content (ambiguous error, hidden state). Most failures are structural — which means they're also accessibility failures, and worth fixing for the human side regardless.
Do I need to add `data-testid` attributes for agents?
Mostly no — good semantic HTML and ARIA labels work for both agents and accessibility users without an agent-specific marker. data-testid is fine where you already use it for QA tests, and it doesn't hurt CUAs to encounter, but adding it specifically for agents is usually a lower-leverage move than fixing the underlying selectors. Stable class names, predictable form labels, real <button> elements — those compound across CUA reliability, accessibility, and your end-to-end test suite.
Will Spekto score CUA targetability in a future framework version?
It's tracked as a Frontier watcher today. Promotion to a scored parameter depends on two things: CUA traffic share crossing a measurable threshold on real sites, and a stable detection methodology that doesn't drift with each new model release. Anthropic, OpenAI, and Google all gate CUA usage behind explicit user consent today, which suppresses traffic share — that constraint relaxes as the agents prove reliable enough for default-on use.