Computer Use · 2026

Updated April 23, 2026FAQ ↓

Agents that can actually drive your computer.

Twelve months ago no AI could operate a desktop better than a drunk intern — OSWorld scores sat under 15%. Today ChatGPT Agent leads OSWorld-Verified at 87.0%, clearing the 72.4% human-expert baseline. Two models now beat humans on OSWorld, three on WebVoyager, and a dozen on SWE-Bench Verified. The big story isn't one breakthrough — it's everyone shipping at once, and the benchmarks themselves getting rewritten mid-flight.

This page tracks 55 agents across 4 architectural categories, scored on 19 benchmarks. Scores come from the official OSWorld-Verified leaderboard, BrowseComp, Steel.dev, maker publications, and independent verification.

Agents tracked

Benchmarks covered

Open-source options

87.0%

OSWorld-V SOTA (vs 72.4% human)

The 4 types of computer use

Click a card to filter the leaderboard below.

Every Computer Use agent, by type

55 tracked agents across 4 types. Ranked within each category by peak benchmark.

Screen-level OS control

15 agents

Agent	Maker	Launch	OSWorld	WebVoyager	SWE-Bench	GAIA	Pricing
Claude Sonnet 4.6 Claude Sonnet 4.6 is a multimodal large language model developed by Anthropic and released on February 25, 2026. Its documented performance	Anthropic	2026-02	72.1	—	—	—	API: 3/15 per M tokens
Kimi K2.5OSS Kimi K2.5 is an open-source, multimodal AI model from Moonshot AI, featuring 1 trillion parameters, vision capabilities, and Agent Swarm tec	Moonshot AI	2026-01	63.3	—	—	—	API pay-as-you-go
Claude Computer Use The first public screen-level OS control API. Powered by Claude 3.5 → 4.x → Opus 4.7. Takes screenshots, identifies UI elements, emits raw c	Anthropic	2024-10	72.1	—	—	—	Claude API — input $5/M, output $25/M
Kimi K2.6OSS Moonshot AI's 1T-param MoE (32B active) built for long-horizon agentic coding (up to 13h continuous) with agent swarm scaling to 300 sub-age	Moonshot AI	2026-04	73.1	—	—	—	API: 0.60/2.75 per M tokens
Holo3-35B-A3B H Company's specialized OSWorld agent, currently #1 on OSWorld-Verified at 80.4% (2 runs, 100 steps). First model to explicitly beat the 72.	H Company	2026-04	80.4	—	—	—	H Company enterprise
Claude Sonnet 4.5 Claude Sonnet 4.5 (Sept 2025 release) on OSWorld-Verified at 62.9%. Benchmark milestone showing rapid improvement from prior Claude generati	Anthropic	2025-09	62.9	—	—	—	Legacy Anthropic API
Seed-1.8 ByteDance Seed's 1.8 general model on OSWorld-Verified at 61.9%. ByteDance has invested heavily in computer-use via the UI-TARS + Seed linea	ByteDance Seed	2025-12	61.9	—	—	—	Doubao ecosystem
EvoCUA-20260105 Meituan's LongCat Team specialized CUA research model. OSWorld-Verified 56.7% at max 50 steps — strong for a step-constrained evaluation.	Meituan LongCat	2026-01	56.7	—	—	—	Research
GUI-Owl-1.5 32BOSS Alibaba Tongyi Lab's Mobile-Agent Team specialized model. OSWorld-Verified 55.4% at max 50 steps. Open weights, Chinese research group.	Alibaba Tongyi Lab	2026-03	55.4	—	—	—	Free (OSS)
DeepMiner-Mano-72B Mininglamp's DeepMiner-Mano specialized 72B GUI agent. OSWorld-Verified at 53.9% with max 100 steps.	Mininglamp Technology	2025-10	53.9	—	—	—	Research
UI-TARS-2 ByteDance Seed's second-generation UI-TARS specialized GUI agent. OSWorld-Verified at 53.1%, max 100 steps.	ByteDance Seed	2025-10	53.1	—	—	—	ByteDance ecosystem
OpenCUA-72BOSS University of Hong Kong × Moonshot AI joint open-source CUA model. OSWorld-Verified at 45.0% (3-run average, 100 steps). Most-cited open-wei	HKU & Moonshot AI	2025-10	45.0	—	—	—	Free (OSS)
OpenAI Computer Use Preview (CUA) OpenAI's Operator/CUA specialized model as benchmarked on OSWorld-Verified at 31.3% (max 50 steps). Now folded into ChatGPT Agent.	OpenAI	2025-01	31.3	—	—	—	ChatGPT Pro
Claude Computer Use Preview Anthropic's original computer-use preview from Oct 2024, scored at 31.3% on OSWorld-Verified with max 50 steps. Proved the paradigm; since s	Anthropic	2024-10	31.3	—	—	—	Anthropic API
Microsoft Copilot Studio CUA Enterprise computer-using agents GA April 2026. Choice of Claude Sonnet 4.5 or OpenAI CUA backends. Built-in credentials vault, Purview audi	Microsoft	2026-04	—	—	—	—	Copilot Studio subscription

Browser-only

19 agents

Agent	Maker	Launch	OSWorld	WebVoyager	SWE-Bench	GAIA	Pricing
Surfer 2 French startup H Company's proprietary enterprise agent. Current WebVoyager SOTA at 97.1%. Top of Steel.dev public leaderboard.	H Company	2026-02	—	97.1	—	—	Enterprise contract
Browser UseOSS Open-source Python library (50k+ GitHub stars). Multi-tab, memory, parallel agents. BYO-LLM. 89.1% on WebVoyager by independent eval.	Browser Use (OSS)	2024-10	—	89.1	—	—	Free (OSS)
Magnitude Independent browser agent focused on test automation. #2 on WebVoyager at 93.9%, behind only Surfer 2.	Magnitude	2025-06	—	93.9	—	—	Contact sales
ChatGPT Agent OpenAI's unified agent (absorbed Operator January 2026). Runs Chrome via vision-action loop with GPT-5-class reasoning. Requires ChatGPT Pro	OpenAI	2026-01	87.0	—	—	—	ChatGPT Pro $200/mo; Plus $20/mo (waitlisted)
SkyvernOSS Open-source browser agent. 85.85% WebVoyager, best-in-class on 'WRITE' form-filling tasks. Workflow chaining. Strong RPA replacement positio	Skyvern (Y Combinator)	2024-04	—	85.8	—	—	Open-source; enterprise contract
StagehandOSS MIT-licensed SDK. V3 rewrite (Feb 2026) uses Chrome DevTools Protocol directly, 44% faster than v2. AI-native caching + self-healing DOM.	Browserbase	2024-06	—	85.8	—	—	Free SDK + Browserbase cloud ($99/mo+)
Project Mariner Chrome-integrated agent powered by Gemini 2.0 → 3.x. Runs 10 concurrent VM tasks. Available via Google AI Ultra subscription. Strongest on S	Google DeepMind	2024-12	—	83.5	—	—	Google AI Ultra subscription
Playwright MCPOSS Microsoft's official MCP server wrapping Playwright. Exposes Chromium/Firefox/WebKit as MCP tools so any MCP client (Claude Code, Cursor, Co	Microsoft	2025-03	—	—	—	—	Free (Apache-2.0)
Multi-On Once a top-3 browser agent. Notably absent from 2026 leaderboards — overtaken by Surfer 2, Magnitude, and AIME.	Multi-On	2023-11	—	—	—	—	Subscription
Chrome DevTools MCPOSS Google Chrome team's official MCP server exposing DevTools Protocol to AI agents. Used by Claude Code + Cursor for browser automation + perf	Google	2025-09	—	—	—	—	Free (Apache-2.0)
mcp-chrome (community)OSS Popular community MCP server that connects any MCP-compatible AI agent to a live Chrome browser via extension + local bridge. Lightweight al	hangwin	2025-06	—	—	—	—	Free (MIT)
Claude for Chrome Anthropic's Chrome extension. Research preview August 2025, expanded to Max users November 2025, all paid users December 2025. Reads and act	Anthropic	2025-08	—	—	—	—	Claude Pro ($20) / Max ($100+)
ChatGPT Atlas OpenAI's Chromium-based AI-native browser. macOS at launch, rolling to Windows/iOS/Android. Persistent ChatGPT sidebar + Agent Mode for auto	OpenAI	2025-10	—	—	—	—	Free sidebar; Agent Mode via ChatGPT Plus/Pro
Perplexity Comet AI-native browser from Perplexity. Started as Max-only ($200/mo), went free globally October 2, 2025. Assistant can execute multi-step workf	Perplexity	2025-07	—	—	—	—	Free · Comet Plus $5 · Pro $20 · Max $200
Dia Browser Company's AI-first browser, successor to Arc (sunset May 2025). Acquired by Atlassian in 2025. Persistent AI command bar + skills wo	The Browser Company	2025-06	—	—	—	—	Free · Dia Pro $20/mo
Fellou Silicon Valley agentic browser. Fellou CE launched September 2025 as 'world's first spatial agentic browser'. Built-in Research + Shopping +	Fellou Inc	2025-09	—	—	—	—	Freemium + paid
Opera Neon Opera's new agentic browser (2025 version, not to be confused with 2017 Neon). Invite-only Sep 2025, public Dec 2025. Subscription model. Ru	Opera	2025-12	—	—	—	—	$19.90/mo
Brave Leo + AI Browsing Brave's built-in AI. Agentic 'AI Browsing' mode shipped December 2025 (Nightly 1.86+). Privacy-first (no account for freemium tier). Chromiu	Brave	2023-11	—	—	—	—	Free · Leo Premium $15/mo
Browserbase Managed headless-browser infrastructure — 'AWS for headless browsers'. $40M Series B June 2025 at $300M valuation. Residential IPs + CAPTCHA	Browserbase	2024-06	—	—	—	—	Usage-based, $99/mo+

Sandboxed VM / container

16 agents

Agent	Maker	Launch	OSWorld	WebVoyager	SWE-Bench	GAIA	Pricing
Writer Action Agent Enterprise agent from Writer (content platform). GAIA Level 3 leader at 61% — surpassed Manus mid-2025.	Writer	2025-09	—	—	—	61.0	Enterprise contract
Manus AI Chinese multi-agent (Executor + Planner + Knowledge) with 29 tools, per-session isolated Linux sandbox. Desktop launched March 2026 with dir	Butterfly Effect	2025-03	—	—	—	57.7	Subscription (invite-only at launch)
OpenHandsOSS MIT-licensed open-source agent (formerly OpenDevin). Docker-based sandbox, BYO-LLM. 53%+ on SWE-bench Verified with Claude. Top-3 consistent	All-Hands AI	2024-03	—	—	53.0	—	Free (self-hosted) or All-Hands Cloud
Lovable AI full-stack app builder (formerly GPT Engineer). Viral 2025. Supabase integration out of the box. Built for non-technical founders.	Lovable	2024-11	—	—	—	—	Free + Pro $25/mo
Replit Agent 4 Cloud development agent with parallel tasks, branching, sub-agent spawning. Effort-based pricing. Mix of Claude/GPT/Gemini backends.	Replit	2026-03	—	—	—	—	Core $25/mo + credits; Pro $100/mo + credits
E2BOSS Open-source secure cloud runtime for AI agents. Firecracker microVMs. Python/JS SDKs + custom templates. Used as the sandbox backend by many	E2B	2023-11	—	—	—	—	Free tier + usage-based
DaytonaOSS Open-source ephemeral dev environments, pivoted to AI-agent runtime in 2025. Self-hostable. Used for per-session Linux sandboxes.	Daytona	2024-05	—	—	—	—	Free (OSS) + cloud paid
Modal Sandboxes Serverless sandbox runtime for agent code execution. Sub-second cold starts. Popular as execution backend for research agents.	Modal Labs	2024-09	—	—	—	—	$0.00003942/CPU-sec
Bedrock AgentCore AWS's managed agent platform (GA Oct 2025, preview July 2025). Includes Browser tool, Code Interpreter, Gateway, Memory. Any model via Bedro	Amazon	2025-10	—	—	—	—	AWS usage-based
Gemini Enterprise (Agentspace) Google's enterprise agent platform, rebranded from Agentspace at Cloud Next 2026. Managed agents + ADK (Agent Development Kit, open source).	Google	2024-12	—	—	—	—	Enterprise contract
Salesforce Agentforce 360 Salesforce's enterprise agent platform. v1 Sep 2024; Agentforce 360 for AWS early 2026. Deep CRM integration, guardrails, per-conversation p	Salesforce	2024-09	—	—	—	—	Per-conversation
IBM watsonx Orchestrate IBM's enterprise agent platform with 150+ pre-built agents in the Agent Catalog. Granite models under the hood. Targets regulated industries	IBM	2025-05	—	—	—	—	Enterprise contract
ServiceNow Now Assist Enterprise AI agents baked into ServiceNow workflows. IBM Granite integration from May 2024. ITSM + HR + customer service automation.	ServiceNow	2023-09	—	—	—	—	Enterprise contract
Bolt.new Browser-based AI full-stack builder. Runs npm install + code in WebContainer directly in the browser. Bolt V2 shipped 2025.	StackBlitz	2024-10	—	—	—	—	Credits-based, $20/mo Pro
Firebase Studio Google's browser-based AI development environment (rebrand of Project IDX). Firebase + Gemini + AI Studio integration. Free tier available.	Google	2025-04	—	—	—	—	Free tier + usage
Emergent Full-stack agent with live preview sandbox. Built-in hosting + auth + database. Aggressive pricing vs Cursor/Replit/Bolt.	Emergent	2025-01	—	—	—	—	Free + paid tiers

Coding-focused

5 agents

Agent	Maker	Launch	OSWorld	WebVoyager	SWE-Bench	GAIA	Pricing
SWE-AgentOSS Open-source research agent (NeurIPS 2024). Mini-SWE-Agent scores >74% on SWE-bench Verified in 100 lines of Python, no tool-calling needed.	Princeton + Stanford	2024-04	—	—	74.0	—	Free (OSS)
Devin Original 'AI software engineer'. Dropped from $500/mo → $20/mo Core in April 2025. SWE-1.5 in-house model scores 40.08% on SWE-bench Pro. 67	Cognition	2024-03	—	—	—	—	Core $20/mo + $2.25/ACU; Team $500/mo
AiderOSS Terminal-first AI pair programmer. Git-integrated. Batch editing. BYO-LLM. Popular in the local-LLM community.	Aider (OSS)	2023-05	—	—	—	—	Free (OSS)
Cursor Agent IDE-first agent inside Cursor. Adds 16pp over raw model via scaffold. Reports 70% on CursorBench with Opus 4.7. Doesn't publish SWE-bench Ve	Anysphere	2025-06	—	—	—	—	$20/mo Pro
v0 Vercel's generative UI + coding agent. Creates React/Next.js from natural language. v0 Agent connects to existing repos.	Vercel	2023-10	—	—	—	—	Credits, $20/mo Pro

Scores from official publications + independent leaderboards (Steel.dev, OSWorld, SWE-Bench). Dash = not published.

How the benchmarks evolved — 2023 → 2026

The benchmark landscape changed faster than the models did. OSWorld v1 was April 2024. By July 2025 it had been replaced by OSWorld-Verified. The 2026 consensus settled around three benchmarks, not one.

2023WebArena + Mind2Web
First generation. DOM-only. Fully gamed by 2025.
Apr 2024OSWorld (v1)
XLANG Lab's first real-desktop VM benchmark. 369 tasks.
Jun 2024WebVoyager + GAIA
Web agents + general reasoning. GPT-4V as judge (later criticized).
Dec 2024TheAgentCompany
CMU: simulated startup. Browse + code + Slack. First multi-surface benchmark.
Apr 2025BrowseComp
OpenAI: 1,266 research-depth browsing problems. Grounded in factuality.
Jul 2025OSWorld-Verified
XLANG Lab revised + cleaned. AWS infra, 50× parallel. Fixed 300+ task bugs.
Sep 2025GDPval
OpenAI: 44 occupations, blinded expert judging of deliverables. Economic impact.
Oct 2025SWE-Bench Pro + Terminal-Bench 2.0
Held-out, contamination-resistant successors to the gamed originals.
2026Core triad consolidates
OSWorld-Verified + BrowseComp + Terminal-Bench 2.0 = weighted agentic score.

Every benchmark that matters — what each one actually measures

A 2026 Berkeley RDI study showed 8 major leaderboards could be gamed to near-100% via config leakage or DOM injection. We only trust Verified, Pro, or programmatically-checked variants with held-out test sets.

How they actually work

The winning 2026 pattern is the modular stack: Planner + Grounder + Executor + Memory + Verifier. Simular's Agent S2 showed that splitting UI reading, planning, and low-level clicking into separate modules beats any monolithic model on long tasks.

Perception

Vision-language models (Claude 4.x, GPT-5.4, Gemini 3) process screenshots. Hybrid DOM + pixel is now standard for browser agents.

Action grounding

Function calls, Code-as-Action (CodeAct), constrained UI DSLs, or raw VLA. CodeAct beats JSON tool-calling on complex tasks.

Planning

ReAct is the backbone. LLMCompiler runs a DAG with 3.6× parallel speedup. Hierarchical decomposition + Reflexion for complex tasks.

Sandboxing

Docker per-session (OpenHands), managed cloud browsers (Browserbase, $300M valuation), e2b, Firecracker microVMs.

Error recovery

Screenshot diffing, retry-with-variation, self-healing DOM layers (Stagehand v3) that adapt to layout shifts.

UI grounding

Mapping “click the blue Buy button” to X/Y coordinates is the #1 failure mode. Cascaded search (ScreenSeekeR) pushed SOTA to 48.7%.

Still unsolved — the safety ceiling

Prompt injection: OpenAI admitted December 2025 it “may never be fully solved.” Joint OpenAI/Anthropic/DeepMind study: >90% bypass rate on every published defense under adaptive attack. CVE-2026-25253: one-click RCE via malicious webpage visit. Brave found indirect prompt injection in Perplexity Comet within weeks of launch.
Benchmark integrity: Berkeley RDI study showed GAIA, OSWorld (pre-Verified), SWE-Bench, Terminal-Bench 1.0 and others could be gamed to ~98% without actually solving tasks. GAIA validation answers are public on HuggingFace. Trust Pro / Verified variants with held-out test sets.
Economics: 10,000 Stagehand extractions/day = $50–$200/day in LLM fees vs zero for deterministic Playwright. Devin ACU math: 1 hour = ~$9 on Core. Per-action costs compound viciously at scale.
Speed: 2–5 seconds per action (vision call + reasoning + execution). Stagehand v3 got 44% faster but inference latency is the physical lower bound.
Captcha + TOS: Cloudflare, Arkose, hCaptcha now score AI-browser traffic patterns. LinkedIn, airline booking, banks actively block. Legal status of automated access remains unresolved.

Frequently asked

Q1.What is a Computer Use agent?+

A Computer Use agent is an AI system that can operate a computer the way a human does — reading screenshots, moving a mouse, typing on a keyboard, running shell commands, or driving a browser. In 2026 the category splits into four types: screen-level OS control (Claude Computer Use, Holo3, Kimi K2.6), browser-only agents (ChatGPT Atlas, Perplexity Comet, Surfer 2), sandboxed VMs / containers (OpenHands, E2B, Browserbase), and coding-focused agents (Claude Code, Devin, Cursor Agent, Codex).

Q2.What is the current OSWorld SOTA in 2026?+

As of April 2026, the OSWorld-Verified leaderboard is led by Holo3-35B-A3B from H Company at 80.4%. Holo3 was the first model to cleanly beat the 72.4% human-expert baseline on the verified split. Kimi K2.6 (Moonshot AI) is second at 73.1% as a general-purpose model, and Claude Sonnet 4.6 is third at 72.1% — effectively tied with the human baseline.

Q3.Is OSWorld still the right benchmark, or is it outdated?+

OSWorld was refreshed in July 2025 as OSWorld-Verified, which fixed 300+ task bugs, moved infrastructure to AWS for 50× parallelization, and cleaned evaluation robustness. The original April 2024 paper version is no longer considered authoritative. OSWorld-Verified is now part of the 2026 canonical triad (with BrowseComp and Terminal-Bench 2.0) that weighted agentic leaderboards use. That said, it's only one dimension — enterprise workflows need TheAgentCompany or WorkArena++, mobile needs AndroidWorld, and economic-impact testing needs GDPval.

Q4.How is OSWorld-Verified different from the original OSWorld?+

OSWorld was released April 2024 by XLANG Lab (University of Hong Kong). OSWorld-Verified shipped July 2025 and systematically addressed 300+ community-reported issues: broken web tasks whose target sites changed, ambiguous instructions, and evaluation edge cases. It also migrated from VMware/Docker to AWS for managed parallelization and cut full-suite evaluation time to under an hour. Both cover the same 369 tasks (361 if the 8 Google Drive tasks that need manual setup are excluded), but scores on the two versions are not directly comparable.

Q5.What other benchmarks exist besides OSWorld?+

The 2026 landscape is much broader than OSWorld alone. The core triad pairs OSWorld-Verified with BrowseComp (OpenAI, browsing depth) and Terminal-Bench 2.0 (CLI tasks). Enterprise workflow: TheAgentCompany (CMU) and WorkArena++ (ServiceNow). Browser-only: WebVoyager, Online-Mind2Web, Mind2Web 2, REAL (Browserbase). Mobile: AndroidWorld. Coding: SWE-Bench Pro, SWE-Bench Verified. Economic impact: GDPval. GUI grounding: ScreenSpot, ScreenSpot-Pro. General reasoning: GAIA.

Q6.Can any agent beat a human on these benchmarks?+

On OSWorld-Verified: yes, two models now cleanly beat the 72.4% human baseline (Holo3-35B-A3B at 80.4%, Kimi K2.6 at 73.1%). On WebVoyager: yes, Surfer 2 at 97.1% pass@1 is above expected human accuracy. On BrowseComp: Claude Mythos Preview at 86.9% but humans with internet access score ~80%. On SWE-Bench Verified: top models pass 87%+ of real GitHub issues. On AndroidWorld: models trail the ~80% human baseline, current SOTA is UI-TARS-2 at 75.8%. On GDPval: agents lose blind expert pairwise comparisons the majority of the time.

Q7.What's the safety picture in 2026?+

Prompt injection remains unsolved. OpenAI stated publicly in December 2025 that it 'may never be fully solved.' A joint OpenAI/Anthropic/DeepMind red-team study showed >90% bypass rate on every published defense under adaptive attack. CVE-2026-25253 demonstrated one-click RCE via a malicious webpage in a browser agent. Brave's security team found indirect prompt injection in Perplexity Comet within weeks of launch. Benchmark integrity is also fragile: a 2026 Berkeley RDI study showed 8 major agentic benchmarks could be gamed to near-100% via config leakage or DOM injection without actually solving tasks.

Q8.Which agents are open-source?+

The strongest open-source options in 2026: OpenHands (All Hands AI, sandbox+coding), Browser Use (Python library driving Chromium), Magnetic-One (Microsoft Research multi-agent), UI-TARS-2 (ByteDance, 53.1% OSWorld-Verified), Kimi K2.5/K2.6 (Moonshot, 63.3% / 73.1% OSWorld-Verified), Holo3 predecessor models (H Company research releases), GUI-Owl-1.5 32B (Alibaba, 55.4% OSWorld-Verified), Playwright MCP (Microsoft), Chrome DevTools MCP (Google). For the full list, toggle 'Open-source only' on the leaderboard above.

Editor's take — April 23, 2026

2026 is the year computer use stopped being a demo and started being a line item. Winners today: H Company on OSWorld-Verified (Holo3-35B-A3B, 80.4%), Anthropic on Terminal-Bench 2.0, OpenAI on browsing (BrowseComp + Mind2Web 2), H Company's Surfer 2 on WebVoyager (97.1%), OpenHands for open-source coding, Claude Code for terminal autonomy, Microsoft Copilot Studio for enterprise distribution. Still shipping: Meta, Apple, Amazon (no first-party CUA product yet). The big pattern: the harness — scaffold + sandbox + verifier + recovery — matters more than the model. Independent tests show Cursor's scaffold adds 16pp over the raw model. The open-source stack has caught and in some cases surpassed proprietary offerings: Kimi K2.6 (73.1% OSWorld-V) is the first open-source model to beat the human baseline.

Sources: OSWorld-Verified, XLANG Lab, BrowseComp, TheAgentCompany, Steel.dev, GDPval, BenchLM, REAL. Last reviewed April 23, 2026.