Agents Lab

Autonomous agents that control computers — OSWorld, BrowseComp & Terminal-Bench.

Quick AnswerUpdated June 19, 2026

Current OSWorld-Verified SOTA: Claude Opus 4.7 from Anthropic at 82.8% (June 2026) — comfortably past the 72.4% human-expert baseline. Strongest open-source: Qwen3-VL-235B (Alibaba) at 66.7% on OSWorld-Verified. Best coding agent: Claude Opus 4.8 (SWE-bench Pro 69.2%, SWE-bench Verified 88.6%). Best browser agent: Surfer 2 (WebVoyager 97.1%, H Company). This page tracks 16 agents across 8 verified benchmarks.

Live · refreshes 30s· 16 agents tracked

Agents that can actually drive your computer.

Twelve months ago no AI could operate a desktop better than a drunk intern — OSWorld scores sat under 15%. Today Claude Opus 4.7 leads OSWorld-Verified at 82.8%, and about ten models now clear the 72.4% human-expert baseline — with the open-source pack only a few points behind. The big story isn't one breakthrough — it's everyone shipping at once.

auto-rotate · 1/3

Claude Opus 4.7Anthropicbeats human

82.8

% on OSWorld-Verified

Claude Mythos PreviewAnthropicbeats human

79.6

% on OSWorld-Verified

GPT-5.5OpenAIbeats human

78.7

% on OSWorld-Verified

Human-expert baseline = 72.4%

This page tracks 16 agents across 4 architectural categories, scored on 8 benchmarks. Scores come from the official OSWorld-Verified leaderboard, BrowseComp, Steel.dev, maker publications, and independent verification.FAQ ↓

Agents tracked

Benchmarks covered

Open-source options

82.8%

OSWorld-V SOTA (vs 72.4% human)

The 4 types of computer use

Click a card to filter the leaderboard below.

16 agents

Screen-level OS control

5 agents


1	Claude Opus 4.7 Prior Anthropic flagship; OSWorld-Verified 82.8% after the zoom-tool fix and a 16K to 128K max-tokens-per-turn harness update.	Anthropic	2026-03	82.8	—	69.7	64.3
2	Claude Mythos Preview Anthropic research preview; strong on deep browsing (BrowseComp 86.9%) and OSWorld-Verified 79.6%.	Anthropic	2026-04	79.6	86.9	—	—
3	Claude Sonnet 4.6 Anthropic's fast mid-tier model; sits right on the human OSWorld-Verified baseline at 72.1%.	Anthropic	2026-02	72.1	—	—	—
4	GPT-5.5 OpenAI's 2026 frontier model; OSWorld-Verified 78.7% and the Terminal-Bench 2.1 leader via Codex CLI (83.4%).	OpenAI	2026-05	78.7	—	—	58.6
5	Claude Sonnet 4.5 Sept 2025 Anthropic model; OSWorld-Verified 62.9% — a marker of how fast the frontier moved in 2026.	Anthropic	2025-09	62.9	—	—	—

Claude Opus 4.7

Anthropic · 2026-03

Prior Anthropic flagship; OSWorld-Verified 82.8% after the zoom-tool fix and a 16K to 128K max-tokens-per-turn harness update.

OSWorld-V 82.8Terminal-2 69.7SWE-Pro 64.3

Claude Mythos Preview

Anthropic · 2026-04

Anthropic research preview; strong on deep browsing (BrowseComp 86.9%) and OSWorld-Verified 79.6%.

OSWorld-V 79.6BrowseComp 86.9

Claude Sonnet 4.6

Anthropic · 2026-02

Anthropic's fast mid-tier model; sits right on the human OSWorld-Verified baseline at 72.1%.

OSWorld-V 72.1

GPT-5.5

OpenAI · 2026-05

OpenAI's 2026 frontier model; OSWorld-Verified 78.7% and the Terminal-Bench 2.1 leader via Codex CLI (83.4%).

OSWorld-V 78.7SWE-Pro 58.6

Claude Sonnet 4.5

Anthropic · 2025-09

Sept 2025 Anthropic model; OSWorld-Verified 62.9% — a marker of how fast the frontier moved in 2026.

OSWorld-V 62.9

Browser-only

2 agents


1	OpenAI Operator OpenAI's original Computer-Using Agent (CUA). WebVoyager 87%, WebArena 58.1%.	OpenAI	2025-01	—	—	—	—
2	Project Mariner Google DeepMind's research browser agent; multi-tab task automation built on Gemini.	Google DeepMind	2024-12	—	—	—	—

OpenAI Operator

OpenAI · 2025-01

OpenAI's original Computer-Using Agent (CUA). WebVoyager 87%, WebArena 58.1%.

Project Mariner

Google DeepMind · 2024-12

Google DeepMind's research browser agent; multi-tab task automation built on Gemini.

Coding-focused

9 agents


1	Claude Code Anthropic's terminal-native coding agent. With Opus 4.8 it scores Terminal-Bench 2.1 78.9%, SWE-bench Pro 69.2%, SWE-bench Verified 88.6%.	Anthropic	2025-02	—	—	78.9	69.2
2	Codex CLI OpenAI's terminal coding agent. With GPT-5.5 it leads Terminal-Bench 2.1 at 83.4%.	OpenAI	2025-04	—	—	83.4	—
3	Kimi K2.6OSS Moonshot's open agentic model; SWE-bench Verified 80.2%, SWE-bench Pro 58.6%, Terminal-Bench 2.0 66.7%. Sustains 4,000+ tool calls over 13-hour sessions.	Moonshot AI	2026-04	73.1	—	66.7	58.6
4	SWE-AgentOSS The academic agent that defined the Agent-Computer Interface for fixing GitHub issues; the open baseline behind SWE-bench.	Princeton + Stanford	2024-04	—	—	—	—
5	Gemini CLIOSS Google's open-source terminal agent. With Gemini 3.1 Pro it scores Terminal-Bench 2.1 70.7%.	Google	2025-06	—	—	70.7	—
6	Kimi K2.5OSS Moonshot's January 2026 open visual agentic model; OSWorld-Verified 63.3%.	Moonshot AI	2026-01	63.3	—	—	—
7	GLM-5.1OSS Z.ai's 754B-param MoE; SWE-bench Pro 58.4% and a 1,530 Code Arena Elo (3rd globally on agentic web development).	Z.ai	2026-04	—	—	—	58.4
8	OpenCodeOSS Free, model-agnostic open-source terminal coding agent — a community alternative to Claude Code and Codex CLI.	OpenCode	2025-06	—	—	—	—
9	AiderOSS Popular open-source pair-programming agent in the terminal; edits across a git repo with any frontier model.	Aider	2023-05	—	—	—	—

Claude Code

Anthropic · 2025-02

Anthropic's terminal-native coding agent. With Opus 4.8 it scores Terminal-Bench 2.1 78.9%, SWE-bench Pro 69.2%, SWE-bench Verified 88.6%.

Terminal-2 78.9SWE-Pro 69.2

Codex CLI

OpenAI · 2025-04

OpenAI's terminal coding agent. With GPT-5.5 it leads Terminal-Bench 2.1 at 83.4%.

Terminal-2 83.4

Kimi K2.6OSS

Moonshot AI · 2026-04

Moonshot's open agentic model; SWE-bench Verified 80.2%, SWE-bench Pro 58.6%, Terminal-Bench 2.0 66.7%. Sustains 4,000+ tool calls over 13-h

OSWorld-V 73.1Terminal-2 66.7SWE-Pro 58.6

SWE-AgentOSS

Princeton + Stanford · 2024-04

The academic agent that defined the Agent-Computer Interface for fixing GitHub issues; the open baseline behind SWE-bench.

Gemini CLIOSS

Google · 2025-06

Google's open-source terminal agent. With Gemini 3.1 Pro it scores Terminal-Bench 2.1 70.7%.

Terminal-2 70.7

Kimi K2.5OSS

Moonshot AI · 2026-01

Moonshot's January 2026 open visual agentic model; OSWorld-Verified 63.3%.

OSWorld-V 63.3

GLM-5.1OSS

Z.ai · 2026-04

Z.ai's 754B-param MoE; SWE-bench Pro 58.4% and a 1,530 Code Arena Elo (3rd globally on agentic web development).

SWE-Pro 58.4

OpenCodeOSS

OpenCode · 2025-06

Free, model-agnostic open-source terminal coding agent — a community alternative to Claude Code and Codex CLI.

AiderOSS

Aider · 2023-05

Popular open-source pair-programming agent in the terminal; edits across a git repo with any frontier model.

Scores verified against OSWorld-Verified, BrowseComp, Steel.dev leaderboard, SWE-Bench, TheAgentCompany, and maker publications. Dash = not published. Click any agent for full details.

Lab opinion

Pulling latest finding…

Predict the next milestone

Who breaks 90% on OSWorld-Verified first?

The 72.4% human baseline already fell. The next round number — 90% on the verified split — would put agents 17.6pp above expert humans. Pick your bet. Stored locally, results aggregated from your own prediction history.

Loading…

What each benchmark actually measures

With a 12-month SOTA trend so you can see if the curve is still climbing or has flattened.

BrowseComp

86.9

SOTA %

1,266 hard browsing problems. Multi-hop, deep web research. Grounded factual answers, no LLM judge.

+79pp / 12mohuman 80%

Holder Claude Mythos Preview1266 tasksDocs →

Terminal-Bench 2.1

83.4

SOTA %

Held-out CLI tasks in real shells. Contamination-resistant successor to Terminal-Bench 1.

+64pp / 12mo

Holder Codex CLI (GPT-5.5)Docs →

WebVoyager

97.1

SOTA %

643 real-world web tasks across Amazon, Booking, dictionaries. GPT-4V judge (criticized).

+72pp / 12mohuman 87%

Holder Surfer 2643 tasksDocs →

SWE-Bench Verified

95.0

SOTA %

500 verified GitHub issues from 12 popular Python repos. Patches must pass repo tests.

+66pp / 12mo

Holder Claude Fable 5500 tasksDocs →

SWE-Bench Pro

69.2

SOTA %

Held-out, multi-language SWE-Bench successor. Contamination-resistant.

+49pp / 12mo

Holder Claude Opus 4.8Docs →

GDPval

47.0

SOTA %

44 occupations, blinded expert pairwise comparison of agent vs human deliverables.

+35pp / 12mo

1320 tasksDocs →

How the benchmarks evolved — 2023 → 2026

The benchmark landscape changed faster than the models did. OSWorld v1 was April 2024. By July 2025 it had been replaced by OSWorld-Verified. The 2026 consensus settled around three benchmarks, not one.

2023WebArena + Mind2Web
First generation. DOM-only. Fully gamed by 2025.
Apr 2024OSWorld (v1)
XLANG Lab's first real-desktop VM benchmark. 369 tasks.
Jun 2024WebVoyager + GAIA
Web agents + general reasoning. GPT-4V as judge (later criticized).
Dec 2024TheAgentCompany
CMU: simulated startup. Browse + code + Slack. First multi-surface benchmark.
Apr 2025BrowseComp
OpenAI: 1,266 research-depth browsing problems. Grounded in factuality.
Jul 2025OSWorld-Verified
XLANG Lab revised + cleaned. AWS infra, 50× parallel. Fixed 300+ task bugs.
Sep 2025GDPval
OpenAI: 44 occupations, blinded expert judging of deliverables. Economic impact.
Oct 2025SWE-Bench Pro + Terminal-Bench 2.0
Held-out, contamination-resistant successors to the gamed originals.
Feb 2026Gemini 3.1 Pro + open GUI models
Google jumps on browsing (BrowseComp 85.9%); GUI-Owl-1.5 closes the open-source gap.
May 2026Claude Opus 4.8 takes the crown
OSWorld-Verified 83.4%, clearing the 72.4% human baseline by 11 points; SWE-bench Pro 69.2%.
2026Core triad consolidates
OSWorld-Verified + BrowseComp + Terminal-Bench 2.1 = weighted agentic score.

Every benchmark that matters — what each one actually measures

A 2026 Berkeley RDI study showed 8 major leaderboards could be gamed to near-100% via config leakage or DOM injection. We only trust Verified, Pro, or programmatically-checked variants with held-out test sets.

The 2026 core triad

BenchLM and agentic leaderboards now weight these three as the canonical agentic score.

BrowseComp

OpenAI's 1,266 hard browsing problems that reward research depth and factual grounding rather than shallow navigation.

86.9

SOTA

1266 tasksHuman: 80%Leader: Claude Mythos PreviewDocs →

Terminal-Bench 2.1

Held-out, contamination-resistant CLI tasks driven end-to-end in a real terminal. Version 2.1 is the 2026 standard for terminal autonomy.

83.4

SOTA

Leader: Codex CLI (GPT-5.5)Docs →

Enterprise workflow

Realistic knowledge-worker tasks inside simulated companies, ServiceNow, professional domains.

GDPval

OpenAI's economic-impact eval across 44 occupations with blinded expert judging of real deliverables. Agents still lose most pairwise comparisons to human experts.

1320 tasksDocs →

Browser-first

Web-scoped tasks. Online-Mind2Web and REAL use programmatic checkers instead of LLM judges.

WebVoyager

Live-website web tasks across 15 real sites. Largely saturated in 2026 - top agents exceed expected human accuracy.

97.1

SOTA

643 tasksHuman: 87%Leader: Surfer 2Docs →

WebArena

Self-hosted replica sites (shopping, forum, GitLab, CMS). First-generation but still cited; programmatically checked.

69.6

SOTA

812 tasksHuman: 78%Leader: Surfer 2Docs →

Specialized

Coding, mobile, GUI-grounding, and reasoning-heavy evaluations.

SWE-Bench Verified

OpenAI-verified 500-issue subset of SWE-Bench. Approaching saturation in 2026 - most frontier models clear 80%+.

95.0

SOTA

500 tasksLeader: Claude Fable 5Docs →

SWE-Bench Pro

Harder, contamination-resistant successor to SWE-Bench Verified: real GitHub issues with held-out tests. Where coding headroom remains.

69.2

SOTA

Leader: Claude Opus 4.8Docs →

Mind2Web

Original Mind2Web web-agent benchmark. 2,000+ tasks across 137 websites. Largely superseded by Online-Mind2Web (live evaluation) and Mind2Web 2 (deep research).

2350 tasksDocs →

How they actually work

The winning 2026 pattern is the modular stack: Planner + Grounder + Executor + Memory + Verifier. Simular's Agent S2 showed that splitting UI reading, planning, and low-level clicking into separate modules beats any monolithic model on long tasks.

Perception

Vision-language models (Claude 4.x, GPT-5.4, Gemini 3) process screenshots. Hybrid DOM + pixel is now standard for browser agents.

Action grounding

Function calls, Code-as-Action (CodeAct), constrained UI DSLs, or raw VLA. CodeAct beats JSON tool-calling on complex tasks.

Planning

ReAct is the backbone. LLMCompiler runs a DAG with 3.6× parallel speedup. Hierarchical decomposition + Reflexion for complex tasks.

Sandboxing

Docker per-session (OpenHands), managed cloud browsers (Browserbase, $300M valuation), e2b, Firecracker microVMs.

Error recovery

Screenshot diffing, retry-with-variation, self-healing DOM layers (Stagehand v3) that adapt to layout shifts.

UI grounding

Mapping “click the blue Buy button” to X/Y coordinates is the #1 failure mode. Cascaded search (ScreenSeekeR) pushed SOTA to 48.7%.

Still unsolved — the safety ceiling

Prompt injection: OpenAI admitted December 2025 it “may never be fully solved.” Joint OpenAI/Anthropic/DeepMind study: >90% bypass rate on every published defense under adaptive attack. CVE-2026-25253: one-click RCE via malicious webpage visit. Brave found indirect prompt injection in Perplexity Comet within weeks of launch.
Benchmark integrity: Berkeley RDI study showed GAIA, OSWorld (pre-Verified), SWE-Bench, Terminal-Bench 1.0 and others could be gamed to ~98% without actually solving tasks. GAIA validation answers are public on HuggingFace. Trust Pro / Verified variants with held-out test sets.
Economics: 10,000 Stagehand extractions/day = $50–$200/day in LLM fees vs zero for deterministic Playwright. Devin ACU math: 1 hour = ~$9 on Core. Per-action costs compound viciously at scale.
Speed: 2–5 seconds per action (vision call + reasoning + execution). Stagehand v3 got 44% faster but inference latency is the physical lower bound.
Captcha + TOS: Cloudflare, Arkose, hCaptcha now score AI-browser traffic patterns. LinkedIn, airline booking, banks actively block. Legal status of automated access remains unresolved.

Frequently asked

Q1.What is a Computer Use agent?+

A Computer Use agent is an AI system that can operate a computer the way a human does — reading screenshots, moving a mouse, typing on a keyboard, running shell commands, or driving a browser. In 2026 the category splits into four types: screen-level OS control (Claude Computer Use, Holo3, Kimi K2.6), browser-only agents (ChatGPT Atlas, Perplexity Comet, Surfer 2), sandboxed VMs / containers (OpenHands, E2B, Browserbase), and coding-focused agents (Claude Code, Devin, Cursor Agent, Codex).

Q2.What is the current OSWorld SOTA in 2026?+

As of June 2026, the OSWorld-Verified leaderboard is led by Claude Opus 4.8 (Anthropic) at 83.4%, ahead of Claude Opus 4.7 (82.8%) and H Company’s Holo3-35B-A3B (82.6%). Roughly ten models now clear the 72.4% human-expert baseline, including GPT-5.4/5.5, Gemini 3.1 Pro and Claude Mythos Preview. The strongest open-source model is Qwen3-VL-235B at 66.7%.

Q3.Is OSWorld still the right benchmark, or is it outdated?+

OSWorld was refreshed in July 2025 as OSWorld-Verified, which fixed 300+ task bugs, moved infrastructure to AWS for 50× parallelization, and cleaned evaluation robustness. The original April 2024 paper version is no longer considered authoritative. OSWorld-Verified is now part of the 2026 canonical triad (with BrowseComp and Terminal-Bench 2.0) that weighted agentic leaderboards use. That said, it's only one dimension — enterprise workflows need TheAgentCompany or WorkArena++, mobile needs AndroidWorld, and economic-impact testing needs GDPval.

Q4.How is OSWorld-Verified different from the original OSWorld?+

OSWorld was released April 2024 by XLANG Lab (University of Hong Kong). OSWorld-Verified shipped July 2025 and systematically addressed 300+ community-reported issues: broken web tasks whose target sites changed, ambiguous instructions, and evaluation edge cases. It also migrated from VMware/Docker to AWS for managed parallelization and cut full-suite evaluation time to under an hour. Both cover the same 369 tasks (361 if the 8 Google Drive tasks that need manual setup are excluded), but scores on the two versions are not directly comparable.

Q5.What other benchmarks exist besides OSWorld?+

The 2026 landscape is much broader than OSWorld alone. The core triad pairs OSWorld-Verified with BrowseComp (OpenAI, browsing depth) and Terminal-Bench 2.0 (CLI tasks). Enterprise workflow: TheAgentCompany (CMU) and WorkArena++ (ServiceNow). Browser-only: WebVoyager, Online-Mind2Web, Mind2Web 2, REAL (Browserbase). Mobile: AndroidWorld. Coding: SWE-Bench Pro, SWE-Bench Verified. Economic impact: GDPval. GUI grounding: ScreenSpot, ScreenSpot-Pro. General reasoning: GAIA.

Q6.Can any agent beat a human on these benchmarks?+

On OSWorld-Verified: yes — about ten models now clear the 72.4% human baseline, led by Claude Opus 4.8 at 83.4%. On WebVoyager: yes, Surfer 2 at 97.1% is above expected human accuracy. On BrowseComp: Claude Mythos Preview (86.9%) and Gemini 3.1 Pro (85.9%) are near the ~80% human-with-internet level. On SWE-Bench Verified: top models pass 88%+ of real GitHub issues, and Claude Fable 5 reaches 95%. On AndroidWorld: Surfer 2 leads at 87.1%, now above the ~80% human baseline. On GDPval: agents still lose blind expert pairwise comparisons the majority of the time.

Q7.What's the safety picture in 2026?+

Prompt injection remains unsolved. OpenAI stated publicly in December 2025 that it 'may never be fully solved.' A joint OpenAI/Anthropic/DeepMind red-team study showed >90% bypass rate on every published defense under adaptive attack. CVE-2026-25253 demonstrated one-click RCE via a malicious webpage in a browser agent. Brave's security team found indirect prompt injection in Perplexity Comet within weeks of launch. Benchmark integrity is also fragile: a 2026 Berkeley RDI study showed 8 major agentic benchmarks could be gamed to near-100% via config leakage or DOM injection without actually solving tasks.

Q8.Which agents are open-source?+

The strongest open-source options in 2026: Qwen3-VL-235B (Alibaba, 66.7% OSWorld-Verified — the open SOTA), Kimi K2.5 (Moonshot, 63.3% OSWorld-Verified) and Kimi K2.6 (open coding/agent leader: SWE-bench Pro 58.6%, Terminal-Bench 2.0 66.7%), GUI-Owl-1.5 (Alibaba, 52.9% OSWorld-Verified at 8B, ScreenSpot-Pro 80.3%), UI-TARS-2 (ByteDance, ~53% OSWorld-Verified, Online-Mind2Web 88.2%), GLM-5.1 (Z.ai, SWE-bench Pro 58.4%), plus the agent frameworks OpenHands, Browser Use (WebVoyager 89%), Magentic-One and Playwright MCP. Toggle 'Open-source only' on the leaderboard above for the full list.

Editor's take — June 19, 2026

2026 is the year computer use stopped being a demo and started being a line item. Winners today: Anthropic on OSWorld-Verified (Claude Opus 4.8, 83.4%) and SWE-bench Pro (69.2%), OpenAI on the terminal (Codex CLI + GPT-5.5 lead Terminal-Bench 2.1 at 83.4%), Gemini 3.1 Pro on agentic browsing (BrowseComp 85.9%), H Company's Surfer 2 on WebVoyager (97.1%), and OpenHands for open-source coding. Still shipping: Meta, Apple, Amazon (no first-party CUA product yet). The big pattern: the harness — scaffold + sandbox + verifier + recovery — matters more than the model. Independent tests show Cursor's scaffold adds 16pp over the raw model. The open-source pack (Qwen3-VL-235B, Kimi K2.6, GUI-Owl-1.5) now trails the proprietary frontier on OSWorld-Verified by only a few points.

Sources: OSWorld-Verified, XLANG Lab, BrowseComp, TheAgentCompany, Steel.dev, GDPval, BenchLM, REAL. Last reviewed June 19, 2026.

Agents that can actually drive your computer.

The 4 types of computer use

Screen-level OS control

Browser-only

Sandboxed VM / container

Coding-focused

Screen-level OS control

Browser-only

Coding-focused

Who breaks 90% on OSWorld-Verified first?

What each benchmark actually measures

BrowseComp

Terminal-Bench 2.1

WebVoyager

SWE-Bench Verified

SWE-Bench Pro

GDPval

How the benchmarks evolved — 2023 → 2026

Every benchmark that matters — what each one actually measures

The 2026 core triad

BrowseComp

Terminal-Bench 2.1

Enterprise workflow

GDPval

Browser-first

WebVoyager

WebArena

Specialized

SWE-Bench Verified

SWE-Bench Pro

Mind2Web

How they actually work

Perception

Action grounding

Planning

Sandboxing

Error recovery

UI grounding

Still unsolved — the safety ceiling

Frequently asked

Editor's take — June 19, 2026