Autonomous agents that control computers — OSWorld, BrowseComp & Terminal-Bench.
Current OSWorld-Verified SOTA: Claude Opus 4.7 from Anthropic at 82.8% (June 2026) — comfortably past the 72.4% human-expert baseline. Strongest open-source: Qwen3-VL-235B (Alibaba) at 66.7% on OSWorld-Verified. Best coding agent: Claude Opus 4.8 (SWE-bench Pro 69.2%, SWE-bench Verified 88.6%). Best browser agent: Surfer 2 (WebVoyager 97.1%, H Company). This page tracks 16 agents across 8 verified benchmarks.
Agents that can actually drive your computer.
Twelve months ago no AI could operate a desktop better than a drunk intern — OSWorld scores sat under 15%. Today Claude Opus 4.7 leads OSWorld-Verified at 82.8%, and about ten models now clear the 72.4% human-expert baseline — with the open-source pack only a few points behind. The big story isn't one breakthrough — it's everyone shipping at once.
This page tracks 16 agents across 4 architectural categories, scored on 8 benchmarks. Scores come from the official OSWorld-Verified leaderboard, BrowseComp, Steel.dev, maker publications, and independent verification.FAQ ↓
The 4 types of computer use
Click a card to filter the leaderboard below.
Screen-level OS control
5 agents| 1 | Prior Anthropic flagship; OSWorld-Verified 82.8% after the zoom-tool fix and a 16K to 128K max-tokens-per-turn harness update. | Anthropic | 2026-03 | 82.8 | — | 69.7 | 64.3 |
| 2 | Anthropic research preview; strong on deep browsing (BrowseComp 86.9%) and OSWorld-Verified 79.6%. | Anthropic | 2026-04 | 79.6 | 86.9 | — | — |
| 3 | Anthropic's fast mid-tier model; sits right on the human OSWorld-Verified baseline at 72.1%. | Anthropic | 2026-02 | 72.1 | — | — | — |
| 4 | OpenAI's 2026 frontier model; OSWorld-Verified 78.7% and the Terminal-Bench 2.1 leader via Codex CLI (83.4%). | OpenAI | 2026-05 | 78.7 | — | — | 58.6 |
| 5 | Sept 2025 Anthropic model; OSWorld-Verified 62.9% — a marker of how fast the frontier moved in 2026. | Anthropic | 2025-09 | 62.9 | — | — | — |
Anthropic · 2026-03
Prior Anthropic flagship; OSWorld-Verified 82.8% after the zoom-tool fix and a 16K to 128K max-tokens-per-turn harness update.
Anthropic · 2026-04
Anthropic research preview; strong on deep browsing (BrowseComp 86.9%) and OSWorld-Verified 79.6%.
Anthropic · 2026-02
Anthropic's fast mid-tier model; sits right on the human OSWorld-Verified baseline at 72.1%.
OpenAI · 2026-05
OpenAI's 2026 frontier model; OSWorld-Verified 78.7% and the Terminal-Bench 2.1 leader via Codex CLI (83.4%).
Anthropic · 2025-09
Sept 2025 Anthropic model; OSWorld-Verified 62.9% — a marker of how fast the frontier moved in 2026.
Browser-only
2 agents| 1 | OpenAI's original Computer-Using Agent (CUA). WebVoyager 87%, WebArena 58.1%. | OpenAI | 2025-01 | — | — | — | — |
| 2 | Google DeepMind's research browser agent; multi-tab task automation built on Gemini. | Google DeepMind | 2024-12 | — | — | — | — |
Coding-focused
9 agents| 1 | Anthropic's terminal-native coding agent. With Opus 4.8 it scores Terminal-Bench 2.1 78.9%, SWE-bench Pro 69.2%, SWE-bench Verified 88.6%. | Anthropic | 2025-02 | — | — | 78.9 | 69.2 |
| 2 | OpenAI's terminal coding agent. With GPT-5.5 it leads Terminal-Bench 2.1 at 83.4%. | OpenAI | 2025-04 | — | — | 83.4 | — |
| 3 | Kimi K2.6OSS Moonshot's open agentic model; SWE-bench Verified 80.2%, SWE-bench Pro 58.6%, Terminal-Bench 2.0 66.7%. Sustains 4,000+ tool calls over 13-hour sessions. | Moonshot AI | 2026-04 | 73.1 | — | 66.7 | 58.6 |
| 4 | SWE-AgentOSS The academic agent that defined the Agent-Computer Interface for fixing GitHub issues; the open baseline behind SWE-bench. | Princeton + Stanford | 2024-04 | — | — | — | — |
| 5 | Gemini CLIOSS Google's open-source terminal agent. With Gemini 3.1 Pro it scores Terminal-Bench 2.1 70.7%. | 2025-06 | — | — | 70.7 | — | |
| 6 | Kimi K2.5OSS Moonshot's January 2026 open visual agentic model; OSWorld-Verified 63.3%. | Moonshot AI | 2026-01 | 63.3 | — | — | — |
| 7 | GLM-5.1OSS Z.ai's 754B-param MoE; SWE-bench Pro 58.4% and a 1,530 Code Arena Elo (3rd globally on agentic web development). | Z.ai | 2026-04 | — | — | — | 58.4 |
| 8 | OpenCodeOSS Free, model-agnostic open-source terminal coding agent — a community alternative to Claude Code and Codex CLI. | OpenCode | 2025-06 | — | — | — | — |
| 9 | AiderOSS Popular open-source pair-programming agent in the terminal; edits across a git repo with any frontier model. | Aider | 2023-05 | — | — | — | — |
Anthropic · 2025-02
Anthropic's terminal-native coding agent. With Opus 4.8 it scores Terminal-Bench 2.1 78.9%, SWE-bench Pro 69.2%, SWE-bench Verified 88.6%.
OpenAI · 2025-04
OpenAI's terminal coding agent. With GPT-5.5 it leads Terminal-Bench 2.1 at 83.4%.
Moonshot AI · 2026-04
Moonshot's open agentic model; SWE-bench Verified 80.2%, SWE-bench Pro 58.6%, Terminal-Bench 2.0 66.7%. Sustains 4,000+ tool calls over 13-h
Princeton + Stanford · 2024-04
The academic agent that defined the Agent-Computer Interface for fixing GitHub issues; the open baseline behind SWE-bench.
Google · 2025-06
Google's open-source terminal agent. With Gemini 3.1 Pro it scores Terminal-Bench 2.1 70.7%.
Moonshot AI · 2026-01
Moonshot's January 2026 open visual agentic model; OSWorld-Verified 63.3%.
Z.ai · 2026-04
Z.ai's 754B-param MoE; SWE-bench Pro 58.4% and a 1,530 Code Arena Elo (3rd globally on agentic web development).
OpenCode · 2025-06
Free, model-agnostic open-source terminal coding agent — a community alternative to Claude Code and Codex CLI.
Aider · 2023-05
Popular open-source pair-programming agent in the terminal; edits across a git repo with any frontier model.
Scores verified against OSWorld-Verified, BrowseComp, Steel.dev leaderboard, SWE-Bench, TheAgentCompany, and maker publications. Dash = not published. Click any agent for full details.
Who breaks 90% on OSWorld-Verified first?
The 72.4% human baseline already fell. The next round number — 90% on the verified split — would put agents 17.6pp above expert humans. Pick your bet. Stored locally, results aggregated from your own prediction history.
What each benchmark actually measures
With a 12-month SOTA trend so you can see if the curve is still climbing or has flattened.
BrowseComp
1,266 hard browsing problems. Multi-hop, deep web research. Grounded factual answers, no LLM judge.
Terminal-Bench 2.1
Held-out CLI tasks in real shells. Contamination-resistant successor to Terminal-Bench 1.
WebVoyager
643 real-world web tasks across Amazon, Booking, dictionaries. GPT-4V judge (criticized).
SWE-Bench Verified
500 verified GitHub issues from 12 popular Python repos. Patches must pass repo tests.
SWE-Bench Pro
Held-out, multi-language SWE-Bench successor. Contamination-resistant.
GDPval
44 occupations, blinded expert pairwise comparison of agent vs human deliverables.
How the benchmarks evolved — 2023 → 2026
The benchmark landscape changed faster than the models did. OSWorld v1 was April 2024. By July 2025 it had been replaced by OSWorld-Verified. The 2026 consensus settled around three benchmarks, not one.
- 2023WebArena + Mind2Web
First generation. DOM-only. Fully gamed by 2025.
- Apr 2024OSWorld (v1)
XLANG Lab's first real-desktop VM benchmark. 369 tasks.
- Jun 2024WebVoyager + GAIA
Web agents + general reasoning. GPT-4V as judge (later criticized).
- Dec 2024TheAgentCompany
CMU: simulated startup. Browse + code + Slack. First multi-surface benchmark.
- Apr 2025BrowseComp
OpenAI: 1,266 research-depth browsing problems. Grounded in factuality.
- Jul 2025OSWorld-Verified
XLANG Lab revised + cleaned. AWS infra, 50× parallel. Fixed 300+ task bugs.
- Sep 2025GDPval
OpenAI: 44 occupations, blinded expert judging of deliverables. Economic impact.
- Oct 2025SWE-Bench Pro + Terminal-Bench 2.0
Held-out, contamination-resistant successors to the gamed originals.
- Feb 2026Gemini 3.1 Pro + open GUI models
Google jumps on browsing (BrowseComp 85.9%); GUI-Owl-1.5 closes the open-source gap.
- May 2026Claude Opus 4.8 takes the crown
OSWorld-Verified 83.4%, clearing the 72.4% human baseline by 11 points; SWE-bench Pro 69.2%.
- 2026Core triad consolidates
OSWorld-Verified + BrowseComp + Terminal-Bench 2.1 = weighted agentic score.
Every benchmark that matters — what each one actually measures
A 2026 Berkeley RDI study showed 8 major leaderboards could be gamed to near-100% via config leakage or DOM injection. We only trust Verified, Pro, or programmatically-checked variants with held-out test sets.
The 2026 core triad
BenchLM and agentic leaderboards now weight these three as the canonical agentic score.
BrowseComp
OpenAI's 1,266 hard browsing problems that reward research depth and factual grounding rather than shallow navigation.
Terminal-Bench 2.1
Held-out, contamination-resistant CLI tasks driven end-to-end in a real terminal. Version 2.1 is the 2026 standard for terminal autonomy.
Enterprise workflow
Realistic knowledge-worker tasks inside simulated companies, ServiceNow, professional domains.
GDPval
OpenAI's economic-impact eval across 44 occupations with blinded expert judging of real deliverables. Agents still lose most pairwise comparisons to human experts.
Browser-first
Web-scoped tasks. Online-Mind2Web and REAL use programmatic checkers instead of LLM judges.
WebVoyager
Live-website web tasks across 15 real sites. Largely saturated in 2026 - top agents exceed expected human accuracy.
WebArena
Self-hosted replica sites (shopping, forum, GitLab, CMS). First-generation but still cited; programmatically checked.
Specialized
Coding, mobile, GUI-grounding, and reasoning-heavy evaluations.
SWE-Bench Verified
OpenAI-verified 500-issue subset of SWE-Bench. Approaching saturation in 2026 - most frontier models clear 80%+.
SWE-Bench Pro
Harder, contamination-resistant successor to SWE-Bench Verified: real GitHub issues with held-out tests. Where coding headroom remains.
Mind2Web
Original Mind2Web web-agent benchmark. 2,000+ tasks across 137 websites. Largely superseded by Online-Mind2Web (live evaluation) and Mind2Web 2 (deep research).
How they actually work
The winning 2026 pattern is the modular stack: Planner + Grounder + Executor + Memory + Verifier. Simular's Agent S2 showed that splitting UI reading, planning, and low-level clicking into separate modules beats any monolithic model on long tasks.
Perception
Vision-language models (Claude 4.x, GPT-5.4, Gemini 3) process screenshots. Hybrid DOM + pixel is now standard for browser agents.
Action grounding
Function calls, Code-as-Action (CodeAct), constrained UI DSLs, or raw VLA. CodeAct beats JSON tool-calling on complex tasks.
Planning
ReAct is the backbone. LLMCompiler runs a DAG with 3.6× parallel speedup. Hierarchical decomposition + Reflexion for complex tasks.
Sandboxing
Docker per-session (OpenHands), managed cloud browsers (Browserbase, $300M valuation), e2b, Firecracker microVMs.
Error recovery
Screenshot diffing, retry-with-variation, self-healing DOM layers (Stagehand v3) that adapt to layout shifts.
UI grounding
Mapping “click the blue Buy button” to X/Y coordinates is the #1 failure mode. Cascaded search (ScreenSeekeR) pushed SOTA to 48.7%.
Still unsolved — the safety ceiling
- Prompt injection: OpenAI admitted December 2025 it “may never be fully solved.” Joint OpenAI/Anthropic/DeepMind study: >90% bypass rate on every published defense under adaptive attack. CVE-2026-25253: one-click RCE via malicious webpage visit. Brave found indirect prompt injection in Perplexity Comet within weeks of launch.
- Benchmark integrity: Berkeley RDI study showed GAIA, OSWorld (pre-Verified), SWE-Bench, Terminal-Bench 1.0 and others could be gamed to ~98% without actually solving tasks. GAIA validation answers are public on HuggingFace. Trust Pro / Verified variants with held-out test sets.
- Economics: 10,000 Stagehand extractions/day = $50–$200/day in LLM fees vs zero for deterministic Playwright. Devin ACU math: 1 hour = ~$9 on Core. Per-action costs compound viciously at scale.
- Speed: 2–5 seconds per action (vision call + reasoning + execution). Stagehand v3 got 44% faster but inference latency is the physical lower bound.
- Captcha + TOS: Cloudflare, Arkose, hCaptcha now score AI-browser traffic patterns. LinkedIn, airline booking, banks actively block. Legal status of automated access remains unresolved.
Frequently asked
Q1.What is a Computer Use agent?+
A Computer Use agent is an AI system that can operate a computer the way a human does — reading screenshots, moving a mouse, typing on a keyboard, running shell commands, or driving a browser. In 2026 the category splits into four types: screen-level OS control (Claude Computer Use, Holo3, Kimi K2.6), browser-only agents (ChatGPT Atlas, Perplexity Comet, Surfer 2), sandboxed VMs / containers (OpenHands, E2B, Browserbase), and coding-focused agents (Claude Code, Devin, Cursor Agent, Codex).
Q2.What is the current OSWorld SOTA in 2026?+
As of June 2026, the OSWorld-Verified leaderboard is led by Claude Opus 4.8 (Anthropic) at 83.4%, ahead of Claude Opus 4.7 (82.8%) and H Company’s Holo3-35B-A3B (82.6%). Roughly ten models now clear the 72.4% human-expert baseline, including GPT-5.4/5.5, Gemini 3.1 Pro and Claude Mythos Preview. The strongest open-source model is Qwen3-VL-235B at 66.7%.
Q3.Is OSWorld still the right benchmark, or is it outdated?+
OSWorld was refreshed in July 2025 as OSWorld-Verified, which fixed 300+ task bugs, moved infrastructure to AWS for 50× parallelization, and cleaned evaluation robustness. The original April 2024 paper version is no longer considered authoritative. OSWorld-Verified is now part of the 2026 canonical triad (with BrowseComp and Terminal-Bench 2.0) that weighted agentic leaderboards use. That said, it's only one dimension — enterprise workflows need TheAgentCompany or WorkArena++, mobile needs AndroidWorld, and economic-impact testing needs GDPval.
Q4.How is OSWorld-Verified different from the original OSWorld?+
OSWorld was released April 2024 by XLANG Lab (University of Hong Kong). OSWorld-Verified shipped July 2025 and systematically addressed 300+ community-reported issues: broken web tasks whose target sites changed, ambiguous instructions, and evaluation edge cases. It also migrated from VMware/Docker to AWS for managed parallelization and cut full-suite evaluation time to under an hour. Both cover the same 369 tasks (361 if the 8 Google Drive tasks that need manual setup are excluded), but scores on the two versions are not directly comparable.
Q5.What other benchmarks exist besides OSWorld?+
The 2026 landscape is much broader than OSWorld alone. The core triad pairs OSWorld-Verified with BrowseComp (OpenAI, browsing depth) and Terminal-Bench 2.0 (CLI tasks). Enterprise workflow: TheAgentCompany (CMU) and WorkArena++ (ServiceNow). Browser-only: WebVoyager, Online-Mind2Web, Mind2Web 2, REAL (Browserbase). Mobile: AndroidWorld. Coding: SWE-Bench Pro, SWE-Bench Verified. Economic impact: GDPval. GUI grounding: ScreenSpot, ScreenSpot-Pro. General reasoning: GAIA.
Q6.Can any agent beat a human on these benchmarks?+
On OSWorld-Verified: yes — about ten models now clear the 72.4% human baseline, led by Claude Opus 4.8 at 83.4%. On WebVoyager: yes, Surfer 2 at 97.1% is above expected human accuracy. On BrowseComp: Claude Mythos Preview (86.9%) and Gemini 3.1 Pro (85.9%) are near the ~80% human-with-internet level. On SWE-Bench Verified: top models pass 88%+ of real GitHub issues, and Claude Fable 5 reaches 95%. On AndroidWorld: Surfer 2 leads at 87.1%, now above the ~80% human baseline. On GDPval: agents still lose blind expert pairwise comparisons the majority of the time.
Q7.What's the safety picture in 2026?+
Prompt injection remains unsolved. OpenAI stated publicly in December 2025 that it 'may never be fully solved.' A joint OpenAI/Anthropic/DeepMind red-team study showed >90% bypass rate on every published defense under adaptive attack. CVE-2026-25253 demonstrated one-click RCE via a malicious webpage in a browser agent. Brave's security team found indirect prompt injection in Perplexity Comet within weeks of launch. Benchmark integrity is also fragile: a 2026 Berkeley RDI study showed 8 major agentic benchmarks could be gamed to near-100% via config leakage or DOM injection without actually solving tasks.
Q8.Which agents are open-source?+
The strongest open-source options in 2026: Qwen3-VL-235B (Alibaba, 66.7% OSWorld-Verified — the open SOTA), Kimi K2.5 (Moonshot, 63.3% OSWorld-Verified) and Kimi K2.6 (open coding/agent leader: SWE-bench Pro 58.6%, Terminal-Bench 2.0 66.7%), GUI-Owl-1.5 (Alibaba, 52.9% OSWorld-Verified at 8B, ScreenSpot-Pro 80.3%), UI-TARS-2 (ByteDance, ~53% OSWorld-Verified, Online-Mind2Web 88.2%), GLM-5.1 (Z.ai, SWE-bench Pro 58.4%), plus the agent frameworks OpenHands, Browser Use (WebVoyager 89%), Magentic-One and Playwright MCP. Toggle 'Open-source only' on the leaderboard above for the full list.
Editor's take — June 19, 2026
2026 is the year computer use stopped being a demo and started being a line item. Winners today: Anthropic on OSWorld-Verified (Claude Opus 4.8, 83.4%) and SWE-bench Pro (69.2%), OpenAI on the terminal (Codex CLI + GPT-5.5 lead Terminal-Bench 2.1 at 83.4%), Gemini 3.1 Pro on agentic browsing (BrowseComp 85.9%), H Company's Surfer 2 on WebVoyager (97.1%), and OpenHands for open-source coding. Still shipping: Meta, Apple, Amazon (no first-party CUA product yet). The big pattern: the harness — scaffold + sandbox + verifier + recovery — matters more than the model. Independent tests show Cursor's scaffold adds 16pp over the raw model. The open-source pack (Qwen3-VL-235B, Kimi K2.6, GUI-Owl-1.5) now trails the proprietary frontier on OSWorld-Verified by only a few points.
Sources: OSWorld-Verified, XLANG Lab, BrowseComp, TheAgentCompany, Steel.dev, GDPval, BenchLM, REAL. Last reviewed June 19, 2026.