Skip to content
gentic.news — AI News Intelligence Platform
Connecting to the Living Graph…
Agents Lab

Autonomous agents that control computers — OSWorld, BrowseComp & Terminal-Bench.

Quick AnswerUpdated June 19, 2026

Current OSWorld-Verified SOTA: Claude Opus 4.7 from Anthropic at 82.8% (June 2026) — comfortably past the 72.4% human-expert baseline. Strongest open-source: Qwen3-VL-235B (Alibaba) at 66.7% on OSWorld-Verified. Best coding agent: Claude Opus 4.8 (SWE-bench Pro 69.2%, SWE-bench Verified 88.6%). Best browser agent: Surfer 2 (WebVoyager 97.1%, H Company). This page tracks 16 agents across 8 verified benchmarks.

Live · refreshes 30s· 16 agents tracked

Agents that can actually drive your computer.

Twelve months ago no AI could operate a desktop better than a drunk intern — OSWorld scores sat under 15%. Today Claude Opus 4.7 leads OSWorld-Verified at 82.8%, and about ten models now clear the 72.4% human-expert baseline — with the open-source pack only a few points behind. The big story isn't one breakthrough — it's everyone shipping at once.

auto-rotate · 1/3
1
Claude Opus 4.7Anthropicbeats human
82.8
% on OSWorld-Verified
2
Claude Mythos PreviewAnthropicbeats human
79.6
% on OSWorld-Verified
3
GPT-5.5OpenAIbeats human
78.7
% on OSWorld-Verified
Human-expert baseline = 72.4%

This page tracks 16 agents across 4 architectural categories, scored on 8 benchmarks. Scores come from the official OSWorld-Verified leaderboard, BrowseComp, Steel.dev, maker publications, and independent verification.FAQ ↓

16
Agents tracked
8
Benchmarks covered
7
Open-source options
82.8%
OSWorld-V SOTA (vs 72.4% human)

The 4 types of computer use

Click a card to filter the leaderboard below.

16 agents

Screen-level OS control

5 agents

Browser-only

2 agents

Coding-focused

9 agents

Scores verified against OSWorld-Verified, BrowseComp, Steel.dev leaderboard, SWE-Bench, TheAgentCompany, and maker publications. Dash = not published. Click any agent for full details.

Lab opinion
Pulling latest finding…
Predict the next milestone

Who breaks 90% on OSWorld-Verified first?

The 72.4% human baseline already fell. The next round number — 90% on the verified split — would put agents 17.6pp above expert humans. Pick your bet. Stored locally, results aggregated from your own prediction history.

Loading…

What each benchmark actually measures

With a 12-month SOTA trend so you can see if the curve is still climbing or has flattened.

BrowseComp

86.9
SOTA %

1,266 hard browsing problems. Multi-hop, deep web research. Grounded factual answers, no LLM judge.

+79pp / 12mohuman 80%
Holder Claude Mythos Preview1266 tasksDocs →

Terminal-Bench 2.1

83.4
SOTA %

Held-out CLI tasks in real shells. Contamination-resistant successor to Terminal-Bench 1.

+64pp / 12mo
Holder Codex CLI (GPT-5.5)Docs →

WebVoyager

97.1
SOTA %

643 real-world web tasks across Amazon, Booking, dictionaries. GPT-4V judge (criticized).

+72pp / 12mohuman 87%
Holder Surfer 2643 tasksDocs →

SWE-Bench Verified

95.0
SOTA %

500 verified GitHub issues from 12 popular Python repos. Patches must pass repo tests.

+66pp / 12mo
Holder Claude Fable 5500 tasksDocs →

SWE-Bench Pro

69.2
SOTA %

Held-out, multi-language SWE-Bench successor. Contamination-resistant.

+49pp / 12mo
Holder Claude Opus 4.8Docs →

GDPval

47.0
SOTA %

44 occupations, blinded expert pairwise comparison of agent vs human deliverables.

+35pp / 12mo
1320 tasksDocs →

How the benchmarks evolved — 2023 → 2026

The benchmark landscape changed faster than the models did. OSWorld v1 was April 2024. By July 2025 it had been replaced by OSWorld-Verified. The 2026 consensus settled around three benchmarks, not one.

  1. 2023WebArena + Mind2Web

    First generation. DOM-only. Fully gamed by 2025.

  2. Apr 2024OSWorld (v1)

    XLANG Lab's first real-desktop VM benchmark. 369 tasks.

  3. Jun 2024WebVoyager + GAIA

    Web agents + general reasoning. GPT-4V as judge (later criticized).

  4. Dec 2024TheAgentCompany

    CMU: simulated startup. Browse + code + Slack. First multi-surface benchmark.

  5. Apr 2025BrowseComp

    OpenAI: 1,266 research-depth browsing problems. Grounded in factuality.

  6. Jul 2025OSWorld-Verified

    XLANG Lab revised + cleaned. AWS infra, 50× parallel. Fixed 300+ task bugs.

  7. Sep 2025GDPval

    OpenAI: 44 occupations, blinded expert judging of deliverables. Economic impact.

  8. Oct 2025SWE-Bench Pro + Terminal-Bench 2.0

    Held-out, contamination-resistant successors to the gamed originals.

  9. Feb 2026Gemini 3.1 Pro + open GUI models

    Google jumps on browsing (BrowseComp 85.9%); GUI-Owl-1.5 closes the open-source gap.

  10. May 2026Claude Opus 4.8 takes the crown

    OSWorld-Verified 83.4%, clearing the 72.4% human baseline by 11 points; SWE-bench Pro 69.2%.

  11. 2026Core triad consolidates

    OSWorld-Verified + BrowseComp + Terminal-Bench 2.1 = weighted agentic score.

Every benchmark that matters — what each one actually measures

A 2026 Berkeley RDI study showed 8 major leaderboards could be gamed to near-100% via config leakage or DOM injection. We only trust Verified, Pro, or programmatically-checked variants with held-out test sets.

The 2026 core triad

BenchLM and agentic leaderboards now weight these three as the canonical agentic score.

BrowseComp

OpenAI's 1,266 hard browsing problems that reward research depth and factual grounding rather than shallow navigation.

86.9
SOTA
1266 tasksHuman: 80%Leader: Claude Mythos PreviewDocs →

Terminal-Bench 2.1

Held-out, contamination-resistant CLI tasks driven end-to-end in a real terminal. Version 2.1 is the 2026 standard for terminal autonomy.

83.4
SOTA
Leader: Codex CLI (GPT-5.5)Docs →

Enterprise workflow

Realistic knowledge-worker tasks inside simulated companies, ServiceNow, professional domains.

GDPval

OpenAI's economic-impact eval across 44 occupations with blinded expert judging of real deliverables. Agents still lose most pairwise comparisons to human experts.

1320 tasksDocs →

Browser-first

Web-scoped tasks. Online-Mind2Web and REAL use programmatic checkers instead of LLM judges.

WebVoyager

Live-website web tasks across 15 real sites. Largely saturated in 2026 - top agents exceed expected human accuracy.

97.1
SOTA
643 tasksHuman: 87%Leader: Surfer 2Docs →

WebArena

Self-hosted replica sites (shopping, forum, GitLab, CMS). First-generation but still cited; programmatically checked.

69.6
SOTA
812 tasksHuman: 78%Leader: Surfer 2Docs →

Specialized

Coding, mobile, GUI-grounding, and reasoning-heavy evaluations.

SWE-Bench Verified

OpenAI-verified 500-issue subset of SWE-Bench. Approaching saturation in 2026 - most frontier models clear 80%+.

95.0
SOTA
500 tasksLeader: Claude Fable 5Docs →

SWE-Bench Pro

Harder, contamination-resistant successor to SWE-Bench Verified: real GitHub issues with held-out tests. Where coding headroom remains.

69.2
SOTA
Leader: Claude Opus 4.8Docs →

Mind2Web

Original Mind2Web web-agent benchmark. 2,000+ tasks across 137 websites. Largely superseded by Online-Mind2Web (live evaluation) and Mind2Web 2 (deep research).

2350 tasksDocs →

How they actually work

The winning 2026 pattern is the modular stack: Planner + Grounder + Executor + Memory + Verifier. Simular's Agent S2 showed that splitting UI reading, planning, and low-level clicking into separate modules beats any monolithic model on long tasks.

Perception

Vision-language models (Claude 4.x, GPT-5.4, Gemini 3) process screenshots. Hybrid DOM + pixel is now standard for browser agents.

Action grounding

Function calls, Code-as-Action (CodeAct), constrained UI DSLs, or raw VLA. CodeAct beats JSON tool-calling on complex tasks.

Planning

ReAct is the backbone. LLMCompiler runs a DAG with 3.6× parallel speedup. Hierarchical decomposition + Reflexion for complex tasks.

Sandboxing

Docker per-session (OpenHands), managed cloud browsers (Browserbase, $300M valuation), e2b, Firecracker microVMs.

Error recovery

Screenshot diffing, retry-with-variation, self-healing DOM layers (Stagehand v3) that adapt to layout shifts.

UI grounding

Mapping “click the blue Buy button” to X/Y coordinates is the #1 failure mode. Cascaded search (ScreenSeekeR) pushed SOTA to 48.7%.

Still unsolved — the safety ceiling

  • Prompt injection: OpenAI admitted December 2025 it “may never be fully solved.” Joint OpenAI/Anthropic/DeepMind study: >90% bypass rate on every published defense under adaptive attack. CVE-2026-25253: one-click RCE via malicious webpage visit. Brave found indirect prompt injection in Perplexity Comet within weeks of launch.
  • Benchmark integrity: Berkeley RDI study showed GAIA, OSWorld (pre-Verified), SWE-Bench, Terminal-Bench 1.0 and others could be gamed to ~98% without actually solving tasks. GAIA validation answers are public on HuggingFace. Trust Pro / Verified variants with held-out test sets.
  • Economics: 10,000 Stagehand extractions/day = $50–$200/day in LLM fees vs zero for deterministic Playwright. Devin ACU math: 1 hour = ~$9 on Core. Per-action costs compound viciously at scale.
  • Speed: 2–5 seconds per action (vision call + reasoning + execution). Stagehand v3 got 44% faster but inference latency is the physical lower bound.
  • Captcha + TOS: Cloudflare, Arkose, hCaptcha now score AI-browser traffic patterns. LinkedIn, airline booking, banks actively block. Legal status of automated access remains unresolved.

Frequently asked

Q1.What is a Computer Use agent?+

A Computer Use agent is an AI system that can operate a computer the way a human does — reading screenshots, moving a mouse, typing on a keyboard, running shell commands, or driving a browser. In 2026 the category splits into four types: screen-level OS control (Claude Computer Use, Holo3, Kimi K2.6), browser-only agents (ChatGPT Atlas, Perplexity Comet, Surfer 2), sandboxed VMs / containers (OpenHands, E2B, Browserbase), and coding-focused agents (Claude Code, Devin, Cursor Agent, Codex).

Q2.What is the current OSWorld SOTA in 2026?+

As of June 2026, the OSWorld-Verified leaderboard is led by Claude Opus 4.8 (Anthropic) at 83.4%, ahead of Claude Opus 4.7 (82.8%) and H Company’s Holo3-35B-A3B (82.6%). Roughly ten models now clear the 72.4% human-expert baseline, including GPT-5.4/5.5, Gemini 3.1 Pro and Claude Mythos Preview. The strongest open-source model is Qwen3-VL-235B at 66.7%.

Q3.Is OSWorld still the right benchmark, or is it outdated?+

OSWorld was refreshed in July 2025 as OSWorld-Verified, which fixed 300+ task bugs, moved infrastructure to AWS for 50× parallelization, and cleaned evaluation robustness. The original April 2024 paper version is no longer considered authoritative. OSWorld-Verified is now part of the 2026 canonical triad (with BrowseComp and Terminal-Bench 2.0) that weighted agentic leaderboards use. That said, it's only one dimension — enterprise workflows need TheAgentCompany or WorkArena++, mobile needs AndroidWorld, and economic-impact testing needs GDPval.

Q4.How is OSWorld-Verified different from the original OSWorld?+

OSWorld was released April 2024 by XLANG Lab (University of Hong Kong). OSWorld-Verified shipped July 2025 and systematically addressed 300+ community-reported issues: broken web tasks whose target sites changed, ambiguous instructions, and evaluation edge cases. It also migrated from VMware/Docker to AWS for managed parallelization and cut full-suite evaluation time to under an hour. Both cover the same 369 tasks (361 if the 8 Google Drive tasks that need manual setup are excluded), but scores on the two versions are not directly comparable.

Q5.What other benchmarks exist besides OSWorld?+

The 2026 landscape is much broader than OSWorld alone. The core triad pairs OSWorld-Verified with BrowseComp (OpenAI, browsing depth) and Terminal-Bench 2.0 (CLI tasks). Enterprise workflow: TheAgentCompany (CMU) and WorkArena++ (ServiceNow). Browser-only: WebVoyager, Online-Mind2Web, Mind2Web 2, REAL (Browserbase). Mobile: AndroidWorld. Coding: SWE-Bench Pro, SWE-Bench Verified. Economic impact: GDPval. GUI grounding: ScreenSpot, ScreenSpot-Pro. General reasoning: GAIA.

Q6.Can any agent beat a human on these benchmarks?+

On OSWorld-Verified: yes — about ten models now clear the 72.4% human baseline, led by Claude Opus 4.8 at 83.4%. On WebVoyager: yes, Surfer 2 at 97.1% is above expected human accuracy. On BrowseComp: Claude Mythos Preview (86.9%) and Gemini 3.1 Pro (85.9%) are near the ~80% human-with-internet level. On SWE-Bench Verified: top models pass 88%+ of real GitHub issues, and Claude Fable 5 reaches 95%. On AndroidWorld: Surfer 2 leads at 87.1%, now above the ~80% human baseline. On GDPval: agents still lose blind expert pairwise comparisons the majority of the time.

Q7.What's the safety picture in 2026?+

Prompt injection remains unsolved. OpenAI stated publicly in December 2025 that it 'may never be fully solved.' A joint OpenAI/Anthropic/DeepMind red-team study showed >90% bypass rate on every published defense under adaptive attack. CVE-2026-25253 demonstrated one-click RCE via a malicious webpage in a browser agent. Brave's security team found indirect prompt injection in Perplexity Comet within weeks of launch. Benchmark integrity is also fragile: a 2026 Berkeley RDI study showed 8 major agentic benchmarks could be gamed to near-100% via config leakage or DOM injection without actually solving tasks.

Q8.Which agents are open-source?+

The strongest open-source options in 2026: Qwen3-VL-235B (Alibaba, 66.7% OSWorld-Verified — the open SOTA), Kimi K2.5 (Moonshot, 63.3% OSWorld-Verified) and Kimi K2.6 (open coding/agent leader: SWE-bench Pro 58.6%, Terminal-Bench 2.0 66.7%), GUI-Owl-1.5 (Alibaba, 52.9% OSWorld-Verified at 8B, ScreenSpot-Pro 80.3%), UI-TARS-2 (ByteDance, ~53% OSWorld-Verified, Online-Mind2Web 88.2%), GLM-5.1 (Z.ai, SWE-bench Pro 58.4%), plus the agent frameworks OpenHands, Browser Use (WebVoyager 89%), Magentic-One and Playwright MCP. Toggle 'Open-source only' on the leaderboard above for the full list.

Editor's take — June 19, 2026

2026 is the year computer use stopped being a demo and started being a line item. Winners today: Anthropic on OSWorld-Verified (Claude Opus 4.8, 83.4%) and SWE-bench Pro (69.2%), OpenAI on the terminal (Codex CLI + GPT-5.5 lead Terminal-Bench 2.1 at 83.4%), Gemini 3.1 Pro on agentic browsing (BrowseComp 85.9%), H Company's Surfer 2 on WebVoyager (97.1%), and OpenHands for open-source coding. Still shipping: Meta, Apple, Amazon (no first-party CUA product yet). The big pattern: the harness — scaffold + sandbox + verifier + recovery — matters more than the model. Independent tests show Cursor's scaffold adds 16pp over the raw model. The open-source pack (Qwen3-VL-235B, Kimi K2.6, GUI-Owl-1.5) now trails the proprietary frontier on OSWorld-Verified by only a few points.

Sources: OSWorld-Verified, XLANG Lab, BrowseComp, TheAgentCompany, Steel.dev, GDPval, BenchLM, REAL. Last reviewed June 19, 2026.