Agents that can actually drive your computer.
Twelve months ago no AI could operate a desktop better than a drunk intern — OSWorld scores sat under 15%. Today ChatGPT Agent leads OSWorld-Verified at 87.0%, clearing the 72.4% human-expert baseline. Two models now beat humans on OSWorld, three on WebVoyager, and a dozen on SWE-Bench Verified. The big story isn't one breakthrough — it's everyone shipping at once, and the benchmarks themselves getting rewritten mid-flight.
This page tracks 55 agents across 4 architectural categories, scored on 19 benchmarks. Scores come from the official OSWorld-Verified leaderboard, BrowseComp, Steel.dev, maker publications, and independent verification.
The 4 types of computer use
Click a card to filter the leaderboard below.
Every Computer Use agent, by type
55 tracked agents across 4 types. Ranked within each category by peak benchmark.
Screen-level OS control
15 agents| Agent | Maker | Launch | OSWorld | WebVoyager | SWE-Bench | GAIA | Pricing |
|---|---|---|---|---|---|---|---|
Claude Sonnet 4.6 is a multimodal large language model developed by Anthropic and released on February 25, 2026. Its documented performance | Anthropic | 2026-02 | 72.1 | — | — | — | API: 3/15 per M tokens |
Kimi K2.5OSS Kimi K2.5 is an open-source, multimodal AI model from Moonshot AI, featuring 1 trillion parameters, vision capabilities, and Agent Swarm tec | Moonshot AI | 2026-01 | 63.3 | — | — | — | API pay-as-you-go |
The first public screen-level OS control API. Powered by Claude 3.5 → 4.x → Opus 4.7. Takes screenshots, identifies UI elements, emits raw c | Anthropic | 2024-10 | 72.1 | — | — | — | Claude API — input $5/M, output $25/M |
Kimi K2.6OSS Moonshot AI's 1T-param MoE (32B active) built for long-horizon agentic coding (up to 13h continuous) with agent swarm scaling to 300 sub-age | Moonshot AI | 2026-04 | 73.1 | — | — | — | API: 0.60/2.75 per M tokens |
H Company's specialized OSWorld agent, currently #1 on OSWorld-Verified at 80.4% (2 runs, 100 steps). First model to explicitly beat the 72. | H Company | 2026-04 | 80.4 | — | — | — | H Company enterprise |
Claude Sonnet 4.5 (Sept 2025 release) on OSWorld-Verified at 62.9%. Benchmark milestone showing rapid improvement from prior Claude generati | Anthropic | 2025-09 | 62.9 | — | — | — | Legacy Anthropic API |
ByteDance Seed's 1.8 general model on OSWorld-Verified at 61.9%. ByteDance has invested heavily in computer-use via the UI-TARS + Seed linea | ByteDance Seed | 2025-12 | 61.9 | — | — | — | Doubao ecosystem |
Meituan's LongCat Team specialized CUA research model. OSWorld-Verified 56.7% at max 50 steps — strong for a step-constrained evaluation. | Meituan LongCat | 2026-01 | 56.7 | — | — | — | Research |
Alibaba Tongyi Lab's Mobile-Agent Team specialized model. OSWorld-Verified 55.4% at max 50 steps. Open weights, Chinese research group. | Alibaba Tongyi Lab | 2026-03 | 55.4 | — | — | — | Free (OSS) |
Mininglamp's DeepMiner-Mano specialized 72B GUI agent. OSWorld-Verified at 53.9% with max 100 steps. | Mininglamp Technology | 2025-10 | 53.9 | — | — | — | Research |
ByteDance Seed's second-generation UI-TARS specialized GUI agent. OSWorld-Verified at 53.1%, max 100 steps. | ByteDance Seed | 2025-10 | 53.1 | — | — | — | ByteDance ecosystem |
OpenCUA-72BOSS University of Hong Kong × Moonshot AI joint open-source CUA model. OSWorld-Verified at 45.0% (3-run average, 100 steps). Most-cited open-wei | HKU & Moonshot AI | 2025-10 | 45.0 | — | — | — | Free (OSS) |
OpenAI's Operator/CUA specialized model as benchmarked on OSWorld-Verified at 31.3% (max 50 steps). Now folded into ChatGPT Agent. | OpenAI | 2025-01 | 31.3 | — | — | — | ChatGPT Pro |
Anthropic's original computer-use preview from Oct 2024, scored at 31.3% on OSWorld-Verified with max 50 steps. Proved the paradigm; since s | Anthropic | 2024-10 | 31.3 | — | — | — | Anthropic API |
Enterprise computer-using agents GA April 2026. Choice of Claude Sonnet 4.5 or OpenAI CUA backends. Built-in credentials vault, Purview audi | Microsoft | 2026-04 | — | — | — | — | Copilot Studio subscription |
Browser-only
19 agents| Agent | Maker | Launch | OSWorld | WebVoyager | SWE-Bench | GAIA | Pricing |
|---|---|---|---|---|---|---|---|
French startup H Company's proprietary enterprise agent. Current WebVoyager SOTA at 97.1%. Top of Steel.dev public leaderboard. | H Company | 2026-02 | — | 97.1 | — | — | Enterprise contract |
Browser UseOSS Open-source Python library (50k+ GitHub stars). Multi-tab, memory, parallel agents. BYO-LLM. 89.1% on WebVoyager by independent eval. | Browser Use (OSS) | 2024-10 | — | 89.1 | — | — | Free (OSS) |
Independent browser agent focused on test automation. #2 on WebVoyager at 93.9%, behind only Surfer 2. | Magnitude | 2025-06 | — | 93.9 | — | — | Contact sales |
OpenAI's unified agent (absorbed Operator January 2026). Runs Chrome via vision-action loop with GPT-5-class reasoning. Requires ChatGPT Pro | OpenAI | 2026-01 | 87.0 | — | — | — | ChatGPT Pro $200/mo; Plus $20/mo (waitlisted) |
SkyvernOSS Open-source browser agent. 85.85% WebVoyager, best-in-class on 'WRITE' form-filling tasks. Workflow chaining. Strong RPA replacement positio | Skyvern (Y Combinator) | 2024-04 | — | 85.8 | — | — | Open-source; enterprise contract |
StagehandOSS MIT-licensed SDK. V3 rewrite (Feb 2026) uses Chrome DevTools Protocol directly, 44% faster than v2. AI-native caching + self-healing DOM. | Browserbase | 2024-06 | — | 85.8 | — | — | Free SDK + Browserbase cloud ($99/mo+) |
Chrome-integrated agent powered by Gemini 2.0 → 3.x. Runs 10 concurrent VM tasks. Available via Google AI Ultra subscription. Strongest on S | Google DeepMind | 2024-12 | — | 83.5 | — | — | Google AI Ultra subscription |
Microsoft's official MCP server wrapping Playwright. Exposes Chromium/Firefox/WebKit as MCP tools so any MCP client (Claude Code, Cursor, Co | Microsoft | 2025-03 | — | — | — | — | Free (Apache-2.0) |
Once a top-3 browser agent. Notably absent from 2026 leaderboards — overtaken by Surfer 2, Magnitude, and AIME. | Multi-On | 2023-11 | — | — | — | — | Subscription |
Google Chrome team's official MCP server exposing DevTools Protocol to AI agents. Used by Claude Code + Cursor for browser automation + perf | 2025-09 | — | — | — | — | Free (Apache-2.0) | |
Popular community MCP server that connects any MCP-compatible AI agent to a live Chrome browser via extension + local bridge. Lightweight al | hangwin | 2025-06 | — | — | — | — | Free (MIT) |
Anthropic's Chrome extension. Research preview August 2025, expanded to Max users November 2025, all paid users December 2025. Reads and act | Anthropic | 2025-08 | — | — | — | — | Claude Pro ($20) / Max ($100+) |
OpenAI's Chromium-based AI-native browser. macOS at launch, rolling to Windows/iOS/Android. Persistent ChatGPT sidebar + Agent Mode for auto | OpenAI | 2025-10 | — | — | — | — | Free sidebar; Agent Mode via ChatGPT Plus/Pro |
AI-native browser from Perplexity. Started as Max-only ($200/mo), went free globally October 2, 2025. Assistant can execute multi-step workf | Perplexity | 2025-07 | — | — | — | — | Free · Comet Plus $5 · Pro $20 · Max $200 |
Browser Company's AI-first browser, successor to Arc (sunset May 2025). Acquired by Atlassian in 2025. Persistent AI command bar + skills wo | The Browser Company | 2025-06 | — | — | — | — | Free · Dia Pro $20/mo |
Silicon Valley agentic browser. Fellou CE launched September 2025 as 'world's first spatial agentic browser'. Built-in Research + Shopping + | Fellou Inc | 2025-09 | — | — | — | — | Freemium + paid |
Opera's new agentic browser (2025 version, not to be confused with 2017 Neon). Invite-only Sep 2025, public Dec 2025. Subscription model. Ru | Opera | 2025-12 | — | — | — | — | $19.90/mo |
Brave's built-in AI. Agentic 'AI Browsing' mode shipped December 2025 (Nightly 1.86+). Privacy-first (no account for freemium tier). Chromiu | Brave | 2023-11 | — | — | — | — | Free · Leo Premium $15/mo |
Managed headless-browser infrastructure — 'AWS for headless browsers'. $40M Series B June 2025 at $300M valuation. Residential IPs + CAPTCHA | Browserbase | 2024-06 | — | — | — | — | Usage-based, $99/mo+ |
Sandboxed VM / container
16 agents| Agent | Maker | Launch | OSWorld | WebVoyager | SWE-Bench | GAIA | Pricing |
|---|---|---|---|---|---|---|---|
Enterprise agent from Writer (content platform). GAIA Level 3 leader at 61% — surpassed Manus mid-2025. | Writer | 2025-09 | — | — | — | 61.0 | Enterprise contract |
Chinese multi-agent (Executor + Planner + Knowledge) with 29 tools, per-session isolated Linux sandbox. Desktop launched March 2026 with dir | Butterfly Effect | 2025-03 | — | — | — | 57.7 | Subscription (invite-only at launch) |
OpenHandsOSS MIT-licensed open-source agent (formerly OpenDevin). Docker-based sandbox, BYO-LLM. 53%+ on SWE-bench Verified with Claude. Top-3 consistent | All-Hands AI | 2024-03 | — | — | 53.0 | — | Free (self-hosted) or All-Hands Cloud |
AI full-stack app builder (formerly GPT Engineer). Viral 2025. Supabase integration out of the box. Built for non-technical founders. | Lovable | 2024-11 | — | — | — | — | Free + Pro $25/mo |
Cloud development agent with parallel tasks, branching, sub-agent spawning. Effort-based pricing. Mix of Claude/GPT/Gemini backends. | Replit | 2026-03 | — | — | — | — | Core $25/mo + credits; Pro $100/mo + credits |
E2BOSS Open-source secure cloud runtime for AI agents. Firecracker microVMs. Python/JS SDKs + custom templates. Used as the sandbox backend by many | E2B | 2023-11 | — | — | — | — | Free tier + usage-based |
DaytonaOSS Open-source ephemeral dev environments, pivoted to AI-agent runtime in 2025. Self-hostable. Used for per-session Linux sandboxes. | Daytona | 2024-05 | — | — | — | — | Free (OSS) + cloud paid |
Serverless sandbox runtime for agent code execution. Sub-second cold starts. Popular as execution backend for research agents. | Modal Labs | 2024-09 | — | — | — | — | $0.00003942/CPU-sec |
AWS's managed agent platform (GA Oct 2025, preview July 2025). Includes Browser tool, Code Interpreter, Gateway, Memory. Any model via Bedro | Amazon | 2025-10 | — | — | — | — | AWS usage-based |
Google's enterprise agent platform, rebranded from Agentspace at Cloud Next 2026. Managed agents + ADK (Agent Development Kit, open source). | 2024-12 | — | — | — | — | Enterprise contract | |
Salesforce's enterprise agent platform. v1 Sep 2024; Agentforce 360 for AWS early 2026. Deep CRM integration, guardrails, per-conversation p | Salesforce | 2024-09 | — | — | — | — | Per-conversation |
IBM's enterprise agent platform with 150+ pre-built agents in the Agent Catalog. Granite models under the hood. Targets regulated industries | IBM | 2025-05 | — | — | — | — | Enterprise contract |
Enterprise AI agents baked into ServiceNow workflows. IBM Granite integration from May 2024. ITSM + HR + customer service automation. | ServiceNow | 2023-09 | — | — | — | — | Enterprise contract |
Browser-based AI full-stack builder. Runs npm install + code in WebContainer directly in the browser. Bolt V2 shipped 2025. | StackBlitz | 2024-10 | — | — | — | — | Credits-based, $20/mo Pro |
Google's browser-based AI development environment (rebrand of Project IDX). Firebase + Gemini + AI Studio integration. Free tier available. | 2025-04 | — | — | — | — | Free tier + usage | |
Full-stack agent with live preview sandbox. Built-in hosting + auth + database. Aggressive pricing vs Cursor/Replit/Bolt. | Emergent | 2025-01 | — | — | — | — | Free + paid tiers |
Coding-focused
5 agents| Agent | Maker | Launch | OSWorld | WebVoyager | SWE-Bench | GAIA | Pricing |
|---|---|---|---|---|---|---|---|
SWE-AgentOSS Open-source research agent (NeurIPS 2024). Mini-SWE-Agent scores >74% on SWE-bench Verified in 100 lines of Python, no tool-calling needed. | Princeton + Stanford | 2024-04 | — | — | 74.0 | — | Free (OSS) |
Original 'AI software engineer'. Dropped from $500/mo → $20/mo Core in April 2025. SWE-1.5 in-house model scores 40.08% on SWE-bench Pro. 67 | Cognition | 2024-03 | — | — | — | — | Core $20/mo + $2.25/ACU; Team $500/mo |
AiderOSS Terminal-first AI pair programmer. Git-integrated. Batch editing. BYO-LLM. Popular in the local-LLM community. | Aider (OSS) | 2023-05 | — | — | — | — | Free (OSS) |
IDE-first agent inside Cursor. Adds 16pp over raw model via scaffold. Reports 70% on CursorBench with Opus 4.7. Doesn't publish SWE-bench Ve | Anysphere | 2025-06 | — | — | — | — | $20/mo Pro |
Vercel's generative UI + coding agent. Creates React/Next.js from natural language. v0 Agent connects to existing repos. | Vercel | 2023-10 | — | — | — | — | Credits, $20/mo Pro |
Scores from official publications + independent leaderboards (Steel.dev, OSWorld, SWE-Bench). Dash = not published.
How the benchmarks evolved — 2023 → 2026
The benchmark landscape changed faster than the models did. OSWorld v1 was April 2024. By July 2025 it had been replaced by OSWorld-Verified. The 2026 consensus settled around three benchmarks, not one.
- 2023WebArena + Mind2Web
First generation. DOM-only. Fully gamed by 2025.
- Apr 2024OSWorld (v1)
XLANG Lab's first real-desktop VM benchmark. 369 tasks.
- Jun 2024WebVoyager + GAIA
Web agents + general reasoning. GPT-4V as judge (later criticized).
- Dec 2024TheAgentCompany
CMU: simulated startup. Browse + code + Slack. First multi-surface benchmark.
- Apr 2025BrowseComp
OpenAI: 1,266 research-depth browsing problems. Grounded in factuality.
- Jul 2025OSWorld-Verified
XLANG Lab revised + cleaned. AWS infra, 50× parallel. Fixed 300+ task bugs.
- Sep 2025GDPval
OpenAI: 44 occupations, blinded expert judging of deliverables. Economic impact.
- Oct 2025SWE-Bench Pro + Terminal-Bench 2.0
Held-out, contamination-resistant successors to the gamed originals.
- 2026Core triad consolidates
OSWorld-Verified + BrowseComp + Terminal-Bench 2.0 = weighted agentic score.
Every benchmark that matters — what each one actually measures
A 2026 Berkeley RDI study showed 8 major leaderboards could be gamed to near-100% via config leakage or DOM injection. We only trust Verified, Pro, or programmatically-checked variants with held-out test sets.
How they actually work
The winning 2026 pattern is the modular stack: Planner + Grounder + Executor + Memory + Verifier. Simular's Agent S2 showed that splitting UI reading, planning, and low-level clicking into separate modules beats any monolithic model on long tasks.
Perception
Vision-language models (Claude 4.x, GPT-5.4, Gemini 3) process screenshots. Hybrid DOM + pixel is now standard for browser agents.
Action grounding
Function calls, Code-as-Action (CodeAct), constrained UI DSLs, or raw VLA. CodeAct beats JSON tool-calling on complex tasks.
Planning
ReAct is the backbone. LLMCompiler runs a DAG with 3.6× parallel speedup. Hierarchical decomposition + Reflexion for complex tasks.
Sandboxing
Docker per-session (OpenHands), managed cloud browsers (Browserbase, $300M valuation), e2b, Firecracker microVMs.
Error recovery
Screenshot diffing, retry-with-variation, self-healing DOM layers (Stagehand v3) that adapt to layout shifts.
UI grounding
Mapping “click the blue Buy button” to X/Y coordinates is the #1 failure mode. Cascaded search (ScreenSeekeR) pushed SOTA to 48.7%.
Still unsolved — the safety ceiling
- Prompt injection: OpenAI admitted December 2025 it “may never be fully solved.” Joint OpenAI/Anthropic/DeepMind study: >90% bypass rate on every published defense under adaptive attack. CVE-2026-25253: one-click RCE via malicious webpage visit. Brave found indirect prompt injection in Perplexity Comet within weeks of launch.
- Benchmark integrity: Berkeley RDI study showed GAIA, OSWorld (pre-Verified), SWE-Bench, Terminal-Bench 1.0 and others could be gamed to ~98% without actually solving tasks. GAIA validation answers are public on HuggingFace. Trust Pro / Verified variants with held-out test sets.
- Economics: 10,000 Stagehand extractions/day = $50–$200/day in LLM fees vs zero for deterministic Playwright. Devin ACU math: 1 hour = ~$9 on Core. Per-action costs compound viciously at scale.
- Speed: 2–5 seconds per action (vision call + reasoning + execution). Stagehand v3 got 44% faster but inference latency is the physical lower bound.
- Captcha + TOS: Cloudflare, Arkose, hCaptcha now score AI-browser traffic patterns. LinkedIn, airline booking, banks actively block. Legal status of automated access remains unresolved.
Frequently asked
Q1.What is a Computer Use agent?+
A Computer Use agent is an AI system that can operate a computer the way a human does — reading screenshots, moving a mouse, typing on a keyboard, running shell commands, or driving a browser. In 2026 the category splits into four types: screen-level OS control (Claude Computer Use, Holo3, Kimi K2.6), browser-only agents (ChatGPT Atlas, Perplexity Comet, Surfer 2), sandboxed VMs / containers (OpenHands, E2B, Browserbase), and coding-focused agents (Claude Code, Devin, Cursor Agent, Codex).
Q2.What is the current OSWorld SOTA in 2026?+
As of April 2026, the OSWorld-Verified leaderboard is led by Holo3-35B-A3B from H Company at 80.4%. Holo3 was the first model to cleanly beat the 72.4% human-expert baseline on the verified split. Kimi K2.6 (Moonshot AI) is second at 73.1% as a general-purpose model, and Claude Sonnet 4.6 is third at 72.1% — effectively tied with the human baseline.
Q3.Is OSWorld still the right benchmark, or is it outdated?+
OSWorld was refreshed in July 2025 as OSWorld-Verified, which fixed 300+ task bugs, moved infrastructure to AWS for 50× parallelization, and cleaned evaluation robustness. The original April 2024 paper version is no longer considered authoritative. OSWorld-Verified is now part of the 2026 canonical triad (with BrowseComp and Terminal-Bench 2.0) that weighted agentic leaderboards use. That said, it's only one dimension — enterprise workflows need TheAgentCompany or WorkArena++, mobile needs AndroidWorld, and economic-impact testing needs GDPval.
Q4.How is OSWorld-Verified different from the original OSWorld?+
OSWorld was released April 2024 by XLANG Lab (University of Hong Kong). OSWorld-Verified shipped July 2025 and systematically addressed 300+ community-reported issues: broken web tasks whose target sites changed, ambiguous instructions, and evaluation edge cases. It also migrated from VMware/Docker to AWS for managed parallelization and cut full-suite evaluation time to under an hour. Both cover the same 369 tasks (361 if the 8 Google Drive tasks that need manual setup are excluded), but scores on the two versions are not directly comparable.
Q5.What other benchmarks exist besides OSWorld?+
The 2026 landscape is much broader than OSWorld alone. The core triad pairs OSWorld-Verified with BrowseComp (OpenAI, browsing depth) and Terminal-Bench 2.0 (CLI tasks). Enterprise workflow: TheAgentCompany (CMU) and WorkArena++ (ServiceNow). Browser-only: WebVoyager, Online-Mind2Web, Mind2Web 2, REAL (Browserbase). Mobile: AndroidWorld. Coding: SWE-Bench Pro, SWE-Bench Verified. Economic impact: GDPval. GUI grounding: ScreenSpot, ScreenSpot-Pro. General reasoning: GAIA.
Q6.Can any agent beat a human on these benchmarks?+
On OSWorld-Verified: yes, two models now cleanly beat the 72.4% human baseline (Holo3-35B-A3B at 80.4%, Kimi K2.6 at 73.1%). On WebVoyager: yes, Surfer 2 at 97.1% pass@1 is above expected human accuracy. On BrowseComp: Claude Mythos Preview at 86.9% but humans with internet access score ~80%. On SWE-Bench Verified: top models pass 87%+ of real GitHub issues. On AndroidWorld: models trail the ~80% human baseline, current SOTA is UI-TARS-2 at 75.8%. On GDPval: agents lose blind expert pairwise comparisons the majority of the time.
Q7.What's the safety picture in 2026?+
Prompt injection remains unsolved. OpenAI stated publicly in December 2025 that it 'may never be fully solved.' A joint OpenAI/Anthropic/DeepMind red-team study showed >90% bypass rate on every published defense under adaptive attack. CVE-2026-25253 demonstrated one-click RCE via a malicious webpage in a browser agent. Brave's security team found indirect prompt injection in Perplexity Comet within weeks of launch. Benchmark integrity is also fragile: a 2026 Berkeley RDI study showed 8 major agentic benchmarks could be gamed to near-100% via config leakage or DOM injection without actually solving tasks.
Q8.Which agents are open-source?+
The strongest open-source options in 2026: OpenHands (All Hands AI, sandbox+coding), Browser Use (Python library driving Chromium), Magnetic-One (Microsoft Research multi-agent), UI-TARS-2 (ByteDance, 53.1% OSWorld-Verified), Kimi K2.5/K2.6 (Moonshot, 63.3% / 73.1% OSWorld-Verified), Holo3 predecessor models (H Company research releases), GUI-Owl-1.5 32B (Alibaba, 55.4% OSWorld-Verified), Playwright MCP (Microsoft), Chrome DevTools MCP (Google). For the full list, toggle 'Open-source only' on the leaderboard above.
Editor's take — April 23, 2026
2026 is the year computer use stopped being a demo and started being a line item. Winners today: H Company on OSWorld-Verified (Holo3-35B-A3B, 80.4%), Anthropic on Terminal-Bench 2.0, OpenAI on browsing (BrowseComp + Mind2Web 2), H Company's Surfer 2 on WebVoyager (97.1%), OpenHands for open-source coding, Claude Code for terminal autonomy, Microsoft Copilot Studio for enterprise distribution. Still shipping: Meta, Apple, Amazon (no first-party CUA product yet). The big pattern: the harness — scaffold + sandbox + verifier + recovery — matters more than the model. Independent tests show Cursor's scaffold adds 16pp over the raw model. The open-source stack has caught and in some cases surpassed proprietary offerings: Kimi K2.6 (73.1% OSWorld-V) is the first open-source model to beat the human baseline.
Sources: OSWorld-Verified, XLANG Lab, BrowseComp, TheAgentCompany, Steel.dev, GDPval, BenchLM, REAL. Last reviewed April 23, 2026.