Autonomous agents that control computers — OSWorld, BrowseComp & Terminal-Bench.
Current OSWorld-Verified SOTA: Kimi K2.6 from H Company at 73.1% (April 2026) — the first model to cleanly beat the 72.4% human-expert baseline. Strongest open-source: Kimi K2.6 (Moonshot AI) at 73.1%. Best coding agent: Claude Opus 4.7 (SWE-Bench Pro 64.3%, SWE-Bench Verified 87.6%). Best browser agent: Surfer 2 (WebVoyager 97.1%, H Company). This page tracks 7 agents across 8 verified benchmarks.
Agents that can actually drive your computer.
Twelve months ago no AI could operate a desktop better than a drunk intern — OSWorld scores sat under 15%. Today Kimi K2.6 leads OSWorld-Verified at 73.1%, clearing the 72.4% human-expert baseline. Two models now beat humans on OSWorld, three on WebVoyager, a dozen on SWE-Bench Verified. The big story isn't one breakthrough — it's everyone shipping at once.
This page tracks 7 agents across 4 architectural categories, scored on 8 benchmarks. Scores come from the official OSWorld-Verified leaderboard, BrowseComp, Steel.dev, maker publications, and independent verification.FAQ ↓
The 4 types of computer use
Click a card to filter the leaderboard below.
Screen-level OS control
2 agents| 1 | Kimi K2.6OSS Moonshot AI's 1T-param MoE (32B active) built for long-horizon agentic coding (up to 13h continuous) with agent swarm scaling to 300 sub-agents. Leads SWE-Bench | Moonshot AI | 2026-04 | 73.1 | — | — | — |
| 2 | Claude Sonnet 4.5 (Sept 2025 release) on OSWorld-Verified at 62.9%. Benchmark milestone showing rapid improvement from prior Claude generations. | Anthropic | 2025-09 | 62.9 | — | — | — |
Moonshot AI · 2026-04
Moonshot AI's 1T-param MoE (32B active) built for long-horizon agentic coding (up to 13h continuous) with agent swarm scaling to 300 sub-age
Anthropic · 2025-09
Claude Sonnet 4.5 (Sept 2025 release) on OSWorld-Verified at 62.9%. Benchmark milestone showing rapid improvement from prior Claude generati
Browser-only
2 agents| 1 | Chrome-integrated agent powered by Gemini 2.0 → 3.x. Runs 10 concurrent VM tasks. Available via Google AI Ultra subscription. Strongest on ScreenSpot. | Google DeepMind | 2024-12 | — | — | — | — |
| 2 | Microsoft's official MCP server wrapping Playwright. Exposes Chromium/Firefox/WebKit as MCP tools so any MCP client (Claude Code, Cursor, Codex CLI) can drive a | Microsoft | 2025-03 | — | — | — | — |
Google DeepMind · 2024-12
Chrome-integrated agent powered by Gemini 2.0 → 3.x. Runs 10 concurrent VM tasks. Available via Google AI Ultra subscription. Strongest on S
Microsoft · 2025-03
Microsoft's official MCP server wrapping Playwright. Exposes Chromium/Firefox/WebKit as MCP tools so any MCP client (Claude Code, Cursor, Co
Sandboxed VM / container
1 agent| 1 | AI full-stack app builder (formerly GPT Engineer). Viral 2025. Supabase integration out of the box. Built for non-technical founders. | Lovable | 2024-11 | — | — | — | — |
Coding-focused
2 agents| 1 | SWE-AgentOSS Open-source research agent (NeurIPS 2024). Mini-SWE-Agent scores >74% on SWE-bench Verified in 100 lines of Python, no tool-calling needed. SOTA open-source sca | Princeton + Stanford | 2024-04 | — | — | — | — |
| 2 | AiderOSS Terminal-first AI pair programmer. Git-integrated. Batch editing. BYO-LLM. Popular in the local-LLM community. | Aider (OSS) | 2023-05 | — | — | — | — |
Princeton + Stanford · 2024-04
Open-source research agent (NeurIPS 2024). Mini-SWE-Agent scores >74% on SWE-bench Verified in 100 lines of Python, no tool-calling needed.
Aider (OSS) · 2023-05
Terminal-first AI pair programmer. Git-integrated. Batch editing. BYO-LLM. Popular in the local-LLM community.
Scores verified against OSWorld-Verified, BrowseComp, Steel.dev leaderboard, SWE-Bench, TheAgentCompany, and maker publications. Dash = not published. Click any agent for full details.
Who breaks 90% on OSWorld-Verified first?
The 72.4% human baseline already fell. The next round number — 90% on the verified split — would put agents 17.6pp above expert humans. Pick your bet. Stored locally, results aggregated from your own prediction history.
What each benchmark actually measures
With a 12-month SOTA trend so you can see if the curve is still climbing or has flattened.
BrowseComp
1,266 hard browsing problems. Multi-hop, deep web research. Grounded factual answers, no LLM judge.
Terminal-Bench 2.0
Held-out CLI tasks in real shells. Contamination-resistant successor to Terminal-Bench 1.
WebVoyager
643 real-world web tasks across Amazon, Booking, dictionaries. GPT-4V judge (criticized).
SWE-Bench Verified
500 verified GitHub issues from 12 popular Python repos. Patches must pass repo tests.
SWE-Bench Pro
Held-out, multi-language SWE-Bench successor. Contamination-resistant.
GDPval
44 occupations, blinded expert pairwise comparison of agent vs human deliverables.
How the benchmarks evolved — 2023 → 2026
The benchmark landscape changed faster than the models did. OSWorld v1 was April 2024. By July 2025 it had been replaced by OSWorld-Verified. The 2026 consensus settled around three benchmarks, not one.
- 2023WebArena + Mind2Web
First generation. DOM-only. Fully gamed by 2025.
- Apr 2024OSWorld (v1)
XLANG Lab's first real-desktop VM benchmark. 369 tasks.
- Jun 2024WebVoyager + GAIA
Web agents + general reasoning. GPT-4V as judge (later criticized).
- Dec 2024TheAgentCompany
CMU: simulated startup. Browse + code + Slack. First multi-surface benchmark.
- Apr 2025BrowseComp
OpenAI: 1,266 research-depth browsing problems. Grounded in factuality.
- Jul 2025OSWorld-Verified
XLANG Lab revised + cleaned. AWS infra, 50× parallel. Fixed 300+ task bugs.
- Sep 2025GDPval
OpenAI: 44 occupations, blinded expert judging of deliverables. Economic impact.
- Oct 2025SWE-Bench Pro + Terminal-Bench 2.0
Held-out, contamination-resistant successors to the gamed originals.
- 2026Core triad consolidates
OSWorld-Verified + BrowseComp + Terminal-Bench 2.0 = weighted agentic score.
Every benchmark that matters — what each one actually measures
A 2026 Berkeley RDI study showed 8 major leaderboards could be gamed to near-100% via config leakage or DOM injection. We only trust Verified, Pro, or programmatically-checked variants with held-out test sets.
The 2026 core triad
BenchLM and agentic leaderboards now weight these three as the canonical agentic score.
BrowseComp
OpenAI's browsing-depth benchmark with 1,266 hard research-style problems. Tests whether an agent can find correct, factually-grounded answers across the open web. Released April 2025.
Terminal-Bench 2.0
Autonomous multi-step shell tasks. Current SOTA: Claude Mythos at 92.1%.
Enterprise workflow
Realistic knowledge-worker tasks inside simulated companies, ServiceNow, professional domains.
GDPval
OpenAI's economic-impact benchmark. Professional work tasks across 44 occupations. Main metric = blinded expert pairwise judgment of deliverables (70.8% inter-rater human agreement). Tests whether agents can do actual white-collar work.
Browser-first
Web-scoped tasks. Online-Mind2Web and REAL use programmatic checkers instead of LLM judges.
WebVoyager
Standard browser-agent benchmark. 643 tasks across 15 websites (Google, Amazon, GitHub, Reddit, Wikipedia). Form filling, navigation, search, shopping. Surfer 2 (H Company) holds SOTA at 97.1%.
WebArena
First-generation web-agent benchmark. Standalone websites in a sandboxed environment (e-commerce, social, classifieds, software). Largely superseded by Online-Mind2Web for live testing.
Specialized
Coding, mobile, GUI-grounding, and reasoning-heavy evaluations.
SWE-Bench Verified
OpenAI-verified subset of SWE-Bench (500 manually-verified Python issues). Originally the gold standard for coding-agent evaluation, now partially gamed — succeeded by SWE-Bench Pro.
SWE-Bench Pro
Contamination-resistant successor to SWE-Bench Verified. 731 held-out real-world GitHub issues across popular Python projects. Private split prevents test-set leakage.
Mind2Web
Original Mind2Web web-agent benchmark. 2,000+ tasks across 137 websites. Largely superseded by Online-Mind2Web (live evaluation) and Mind2Web 2 (deep research).
How they actually work
The winning 2026 pattern is the modular stack: Planner + Grounder + Executor + Memory + Verifier. Simular's Agent S2 showed that splitting UI reading, planning, and low-level clicking into separate modules beats any monolithic model on long tasks.
Perception
Vision-language models (Claude 4.x, GPT-5.4, Gemini 3) process screenshots. Hybrid DOM + pixel is now standard for browser agents.
Action grounding
Function calls, Code-as-Action (CodeAct), constrained UI DSLs, or raw VLA. CodeAct beats JSON tool-calling on complex tasks.
Planning
ReAct is the backbone. LLMCompiler runs a DAG with 3.6× parallel speedup. Hierarchical decomposition + Reflexion for complex tasks.
Sandboxing
Docker per-session (OpenHands), managed cloud browsers (Browserbase, $300M valuation), e2b, Firecracker microVMs.
Error recovery
Screenshot diffing, retry-with-variation, self-healing DOM layers (Stagehand v3) that adapt to layout shifts.
UI grounding
Mapping “click the blue Buy button” to X/Y coordinates is the #1 failure mode. Cascaded search (ScreenSeekeR) pushed SOTA to 48.7%.
Still unsolved — the safety ceiling
- Prompt injection: OpenAI admitted December 2025 it “may never be fully solved.” Joint OpenAI/Anthropic/DeepMind study: >90% bypass rate on every published defense under adaptive attack. CVE-2026-25253: one-click RCE via malicious webpage visit. Brave found indirect prompt injection in Perplexity Comet within weeks of launch.
- Benchmark integrity: Berkeley RDI study showed GAIA, OSWorld (pre-Verified), SWE-Bench, Terminal-Bench 1.0 and others could be gamed to ~98% without actually solving tasks. GAIA validation answers are public on HuggingFace. Trust Pro / Verified variants with held-out test sets.
- Economics: 10,000 Stagehand extractions/day = $50–$200/day in LLM fees vs zero for deterministic Playwright. Devin ACU math: 1 hour = ~$9 on Core. Per-action costs compound viciously at scale.
- Speed: 2–5 seconds per action (vision call + reasoning + execution). Stagehand v3 got 44% faster but inference latency is the physical lower bound.
- Captcha + TOS: Cloudflare, Arkose, hCaptcha now score AI-browser traffic patterns. LinkedIn, airline booking, banks actively block. Legal status of automated access remains unresolved.
Frequently asked
Q1.What is a Computer Use agent?+
A Computer Use agent is an AI system that can operate a computer the way a human does — reading screenshots, moving a mouse, typing on a keyboard, running shell commands, or driving a browser. In 2026 the category splits into four types: screen-level OS control (Claude Computer Use, Holo3, Kimi K2.6), browser-only agents (ChatGPT Atlas, Perplexity Comet, Surfer 2), sandboxed VMs / containers (OpenHands, E2B, Browserbase), and coding-focused agents (Claude Code, Devin, Cursor Agent, Codex).
Q2.What is the current OSWorld SOTA in 2026?+
As of April 2026, the OSWorld-Verified leaderboard is led by Holo3-35B-A3B from H Company at 80.4%. Holo3 was the first model to cleanly beat the 72.4% human-expert baseline on the verified split. Kimi K2.6 (Moonshot AI) is second at 73.1% as a general-purpose model, and Claude Sonnet 4.6 is third at 72.1% — effectively tied with the human baseline.
Q3.Is OSWorld still the right benchmark, or is it outdated?+
OSWorld was refreshed in July 2025 as OSWorld-Verified, which fixed 300+ task bugs, moved infrastructure to AWS for 50× parallelization, and cleaned evaluation robustness. The original April 2024 paper version is no longer considered authoritative. OSWorld-Verified is now part of the 2026 canonical triad (with BrowseComp and Terminal-Bench 2.0) that weighted agentic leaderboards use. That said, it's only one dimension — enterprise workflows need TheAgentCompany or WorkArena++, mobile needs AndroidWorld, and economic-impact testing needs GDPval.
Q4.How is OSWorld-Verified different from the original OSWorld?+
OSWorld was released April 2024 by XLANG Lab (University of Hong Kong). OSWorld-Verified shipped July 2025 and systematically addressed 300+ community-reported issues: broken web tasks whose target sites changed, ambiguous instructions, and evaluation edge cases. It also migrated from VMware/Docker to AWS for managed parallelization and cut full-suite evaluation time to under an hour. Both cover the same 369 tasks (361 if the 8 Google Drive tasks that need manual setup are excluded), but scores on the two versions are not directly comparable.
Q5.What other benchmarks exist besides OSWorld?+
The 2026 landscape is much broader than OSWorld alone. The core triad pairs OSWorld-Verified with BrowseComp (OpenAI, browsing depth) and Terminal-Bench 2.0 (CLI tasks). Enterprise workflow: TheAgentCompany (CMU) and WorkArena++ (ServiceNow). Browser-only: WebVoyager, Online-Mind2Web, Mind2Web 2, REAL (Browserbase). Mobile: AndroidWorld. Coding: SWE-Bench Pro, SWE-Bench Verified. Economic impact: GDPval. GUI grounding: ScreenSpot, ScreenSpot-Pro. General reasoning: GAIA.
Q6.Can any agent beat a human on these benchmarks?+
On OSWorld-Verified: yes, two models now cleanly beat the 72.4% human baseline (Holo3-35B-A3B at 80.4%, Kimi K2.6 at 73.1%). On WebVoyager: yes, Surfer 2 at 97.1% pass@1 is above expected human accuracy. On BrowseComp: Claude Mythos Preview at 86.9% but humans with internet access score ~80%. On SWE-Bench Verified: top models pass 87%+ of real GitHub issues. On AndroidWorld: models trail the ~80% human baseline, current SOTA is UI-TARS-2 at 75.8%. On GDPval: agents lose blind expert pairwise comparisons the majority of the time.
Q7.What's the safety picture in 2026?+
Prompt injection remains unsolved. OpenAI stated publicly in December 2025 that it 'may never be fully solved.' A joint OpenAI/Anthropic/DeepMind red-team study showed >90% bypass rate on every published defense under adaptive attack. CVE-2026-25253 demonstrated one-click RCE via a malicious webpage in a browser agent. Brave's security team found indirect prompt injection in Perplexity Comet within weeks of launch. Benchmark integrity is also fragile: a 2026 Berkeley RDI study showed 8 major agentic benchmarks could be gamed to near-100% via config leakage or DOM injection without actually solving tasks.
Q8.Which agents are open-source?+
The strongest open-source options in 2026: OpenHands (All Hands AI, sandbox+coding), Browser Use (Python library driving Chromium), Magnetic-One (Microsoft Research multi-agent), UI-TARS-2 (ByteDance, 53.1% OSWorld-Verified), Kimi K2.5/K2.6 (Moonshot, 63.3% / 73.1% OSWorld-Verified), Holo3 predecessor models (H Company research releases), GUI-Owl-1.5 32B (Alibaba, 55.4% OSWorld-Verified), Playwright MCP (Microsoft), Chrome DevTools MCP (Google). For the full list, toggle 'Open-source only' on the leaderboard above.
Editor's take — April 24, 2026
2026 is the year computer use stopped being a demo and started being a line item. Winners today: H Company on OSWorld-Verified (Holo3-35B-A3B, 80.4%), Anthropic on Terminal-Bench 2.0, OpenAI on browsing (BrowseComp + Mind2Web 2), H Company's Surfer 2 on WebVoyager (97.1%), OpenHands for open-source coding, Claude Code for terminal autonomy, Microsoft Copilot Studio for enterprise distribution. Still shipping: Meta, Apple, Amazon (no first-party CUA product yet). The big pattern: the harness — scaffold + sandbox + verifier + recovery — matters more than the model. Independent tests show Cursor's scaffold adds 16pp over the raw model. The open-source stack has caught and in some cases surpassed proprietary offerings: Kimi K2.6 (73.1% OSWorld-V) is the first open-source model to beat the human baseline.
Sources: OSWorld-Verified, XLANG Lab, BrowseComp, TheAgentCompany, Steel.dev, GDPval, BenchLM, REAL. Last reviewed April 24, 2026.