Skip to content
gentic.news — AI News Intelligence Platform
Connecting to the Living Graph…
Agents Lab

Autonomous agents that control computers — OSWorld, BrowseComp & Terminal-Bench.

Quick AnswerUpdated April 24, 2026

Current OSWorld-Verified SOTA: Kimi K2.6 from H Company at 73.1% (April 2026) — the first model to cleanly beat the 72.4% human-expert baseline. Strongest open-source: Kimi K2.6 (Moonshot AI) at 73.1%. Best coding agent: Claude Opus 4.7 (SWE-Bench Pro 64.3%, SWE-Bench Verified 87.6%). Best browser agent: Surfer 2 (WebVoyager 97.1%, H Company). This page tracks 7 agents across 8 verified benchmarks.

Live · refreshes 30s· 7 agents tracked

Agents that can actually drive your computer.

Twelve months ago no AI could operate a desktop better than a drunk intern — OSWorld scores sat under 15%. Today Kimi K2.6 leads OSWorld-Verified at 73.1%, clearing the 72.4% human-expert baseline. Two models now beat humans on OSWorld, three on WebVoyager, a dozen on SWE-Bench Verified. The big story isn't one breakthrough — it's everyone shipping at once.

auto-rotate · 1/3
1
Kimi K2.6Moonshot AIOSSbeats human
73.1
% on OSWorld-Verified
2
62.9
% on OSWorld-Verified
Human-expert baseline = 72.4%

This page tracks 7 agents across 4 architectural categories, scored on 8 benchmarks. Scores come from the official OSWorld-Verified leaderboard, BrowseComp, Steel.dev, maker publications, and independent verification.FAQ ↓

7
Agents tracked
8
Benchmarks covered
4
Open-source options
73.1%
OSWorld-V SOTA (vs 72.4% human)

The 4 types of computer use

Click a card to filter the leaderboard below.

7 agents

Screen-level OS control

2 agents

Browser-only

2 agents

Sandboxed VM / container

1 agent

Coding-focused

2 agents

Scores verified against OSWorld-Verified, BrowseComp, Steel.dev leaderboard, SWE-Bench, TheAgentCompany, and maker publications. Dash = not published. Click any agent for full details.

Lab opinion
Pulling latest finding…
Predict the next milestone

Who breaks 90% on OSWorld-Verified first?

The 72.4% human baseline already fell. The next round number — 90% on the verified split — would put agents 17.6pp above expert humans. Pick your bet. Stored locally, results aggregated from your own prediction history.

Loading…

What each benchmark actually measures

With a 12-month SOTA trend so you can see if the curve is still climbing or has flattened.

BrowseComp

86.9
SOTA %

1,266 hard browsing problems. Multi-hop, deep web research. Grounded factual answers, no LLM judge.

+79pp / 12mohuman 80%
Holder Claude Mythos Preview1266 tasksDocs →

Terminal-Bench 2.0

92.1
SOTA %

Held-out CLI tasks in real shells. Contamination-resistant successor to Terminal-Bench 1.

+64pp / 12mo
Holder Claude MythosDocs →

WebVoyager

97.1
SOTA %

643 real-world web tasks across Amazon, Booking, dictionaries. GPT-4V judge (criticized).

+72pp / 12mo
Holder Surfer 2643 tasksDocs →

SWE-Bench Verified

87.6
SOTA %

500 verified GitHub issues from 12 popular Python repos. Patches must pass repo tests.

+66pp / 12mo
Holder Claude Opus 4.7500 tasksDocs →

SWE-Bench Pro

64.3
SOTA %

Held-out, multi-language SWE-Bench successor. Contamination-resistant.

+49pp / 12mo
Holder Claude Opus 4.7731 tasksDocs →

GDPval

47.6
SOTA %

44 occupations, blinded expert pairwise comparison of agent vs human deliverables.

+35pp / 12mo
Holder GPT-5.4220 tasksDocs →

How the benchmarks evolved — 2023 → 2026

The benchmark landscape changed faster than the models did. OSWorld v1 was April 2024. By July 2025 it had been replaced by OSWorld-Verified. The 2026 consensus settled around three benchmarks, not one.

  1. 2023WebArena + Mind2Web

    First generation. DOM-only. Fully gamed by 2025.

  2. Apr 2024OSWorld (v1)

    XLANG Lab's first real-desktop VM benchmark. 369 tasks.

  3. Jun 2024WebVoyager + GAIA

    Web agents + general reasoning. GPT-4V as judge (later criticized).

  4. Dec 2024TheAgentCompany

    CMU: simulated startup. Browse + code + Slack. First multi-surface benchmark.

  5. Apr 2025BrowseComp

    OpenAI: 1,266 research-depth browsing problems. Grounded in factuality.

  6. Jul 2025OSWorld-Verified

    XLANG Lab revised + cleaned. AWS infra, 50× parallel. Fixed 300+ task bugs.

  7. Sep 2025GDPval

    OpenAI: 44 occupations, blinded expert judging of deliverables. Economic impact.

  8. Oct 2025SWE-Bench Pro + Terminal-Bench 2.0

    Held-out, contamination-resistant successors to the gamed originals.

  9. 2026Core triad consolidates

    OSWorld-Verified + BrowseComp + Terminal-Bench 2.0 = weighted agentic score.

Every benchmark that matters — what each one actually measures

A 2026 Berkeley RDI study showed 8 major leaderboards could be gamed to near-100% via config leakage or DOM injection. We only trust Verified, Pro, or programmatically-checked variants with held-out test sets.

The 2026 core triad

BenchLM and agentic leaderboards now weight these three as the canonical agentic score.

BrowseComp

OpenAI's browsing-depth benchmark with 1,266 hard research-style problems. Tests whether an agent can find correct, factually-grounded answers across the open web. Released April 2025.

86.9
SOTA
1266 tasksHuman: 80%Leader: Claude Mythos PreviewDocs →

Terminal-Bench 2.0

Autonomous multi-step shell tasks. Current SOTA: Claude Mythos at 92.1%.

92.1
SOTA
Leader: Claude MythosDocs →

Enterprise workflow

Realistic knowledge-worker tasks inside simulated companies, ServiceNow, professional domains.

GDPval

OpenAI's economic-impact benchmark. Professional work tasks across 44 occupations. Main metric = blinded expert pairwise judgment of deliverables (70.8% inter-rater human agreement). Tests whether agents can do actual white-collar work.

47.6
SOTA
220 tasksLeader: GPT-5.4Docs →

Browser-first

Web-scoped tasks. Online-Mind2Web and REAL use programmatic checkers instead of LLM judges.

WebVoyager

Standard browser-agent benchmark. 643 tasks across 15 websites (Google, Amazon, GitHub, Reddit, Wikipedia). Form filling, navigation, search, shopping. Surfer 2 (H Company) holds SOTA at 97.1%.

97.1
SOTA
643 tasksLeader: Surfer 2Docs →

WebArena

First-generation web-agent benchmark. Standalone websites in a sandboxed environment (e-commerce, social, classifieds, software). Largely superseded by Online-Mind2Web for live testing.

71.6
SOTA
812 tasksDocs →

Specialized

Coding, mobile, GUI-grounding, and reasoning-heavy evaluations.

SWE-Bench Verified

OpenAI-verified subset of SWE-Bench (500 manually-verified Python issues). Originally the gold standard for coding-agent evaluation, now partially gamed — succeeded by SWE-Bench Pro.

87.6
SOTA
500 tasksLeader: Claude Opus 4.7Docs →

SWE-Bench Pro

Contamination-resistant successor to SWE-Bench Verified. 731 held-out real-world GitHub issues across popular Python projects. Private split prevents test-set leakage.

64.3
SOTA
731 tasksLeader: Claude Opus 4.7Docs →

Mind2Web

Original Mind2Web web-agent benchmark. 2,000+ tasks across 137 websites. Largely superseded by Online-Mind2Web (live evaluation) and Mind2Web 2 (deep research).

2350 tasksDocs →

How they actually work

The winning 2026 pattern is the modular stack: Planner + Grounder + Executor + Memory + Verifier. Simular's Agent S2 showed that splitting UI reading, planning, and low-level clicking into separate modules beats any monolithic model on long tasks.

Perception

Vision-language models (Claude 4.x, GPT-5.4, Gemini 3) process screenshots. Hybrid DOM + pixel is now standard for browser agents.

Action grounding

Function calls, Code-as-Action (CodeAct), constrained UI DSLs, or raw VLA. CodeAct beats JSON tool-calling on complex tasks.

Planning

ReAct is the backbone. LLMCompiler runs a DAG with 3.6× parallel speedup. Hierarchical decomposition + Reflexion for complex tasks.

Sandboxing

Docker per-session (OpenHands), managed cloud browsers (Browserbase, $300M valuation), e2b, Firecracker microVMs.

Error recovery

Screenshot diffing, retry-with-variation, self-healing DOM layers (Stagehand v3) that adapt to layout shifts.

UI grounding

Mapping “click the blue Buy button” to X/Y coordinates is the #1 failure mode. Cascaded search (ScreenSeekeR) pushed SOTA to 48.7%.

Still unsolved — the safety ceiling

  • Prompt injection: OpenAI admitted December 2025 it “may never be fully solved.” Joint OpenAI/Anthropic/DeepMind study: >90% bypass rate on every published defense under adaptive attack. CVE-2026-25253: one-click RCE via malicious webpage visit. Brave found indirect prompt injection in Perplexity Comet within weeks of launch.
  • Benchmark integrity: Berkeley RDI study showed GAIA, OSWorld (pre-Verified), SWE-Bench, Terminal-Bench 1.0 and others could be gamed to ~98% without actually solving tasks. GAIA validation answers are public on HuggingFace. Trust Pro / Verified variants with held-out test sets.
  • Economics: 10,000 Stagehand extractions/day = $50–$200/day in LLM fees vs zero for deterministic Playwright. Devin ACU math: 1 hour = ~$9 on Core. Per-action costs compound viciously at scale.
  • Speed: 2–5 seconds per action (vision call + reasoning + execution). Stagehand v3 got 44% faster but inference latency is the physical lower bound.
  • Captcha + TOS: Cloudflare, Arkose, hCaptcha now score AI-browser traffic patterns. LinkedIn, airline booking, banks actively block. Legal status of automated access remains unresolved.

Frequently asked

Q1.What is a Computer Use agent?+

A Computer Use agent is an AI system that can operate a computer the way a human does — reading screenshots, moving a mouse, typing on a keyboard, running shell commands, or driving a browser. In 2026 the category splits into four types: screen-level OS control (Claude Computer Use, Holo3, Kimi K2.6), browser-only agents (ChatGPT Atlas, Perplexity Comet, Surfer 2), sandboxed VMs / containers (OpenHands, E2B, Browserbase), and coding-focused agents (Claude Code, Devin, Cursor Agent, Codex).

Q2.What is the current OSWorld SOTA in 2026?+

As of April 2026, the OSWorld-Verified leaderboard is led by Holo3-35B-A3B from H Company at 80.4%. Holo3 was the first model to cleanly beat the 72.4% human-expert baseline on the verified split. Kimi K2.6 (Moonshot AI) is second at 73.1% as a general-purpose model, and Claude Sonnet 4.6 is third at 72.1% — effectively tied with the human baseline.

Q3.Is OSWorld still the right benchmark, or is it outdated?+

OSWorld was refreshed in July 2025 as OSWorld-Verified, which fixed 300+ task bugs, moved infrastructure to AWS for 50× parallelization, and cleaned evaluation robustness. The original April 2024 paper version is no longer considered authoritative. OSWorld-Verified is now part of the 2026 canonical triad (with BrowseComp and Terminal-Bench 2.0) that weighted agentic leaderboards use. That said, it's only one dimension — enterprise workflows need TheAgentCompany or WorkArena++, mobile needs AndroidWorld, and economic-impact testing needs GDPval.

Q4.How is OSWorld-Verified different from the original OSWorld?+

OSWorld was released April 2024 by XLANG Lab (University of Hong Kong). OSWorld-Verified shipped July 2025 and systematically addressed 300+ community-reported issues: broken web tasks whose target sites changed, ambiguous instructions, and evaluation edge cases. It also migrated from VMware/Docker to AWS for managed parallelization and cut full-suite evaluation time to under an hour. Both cover the same 369 tasks (361 if the 8 Google Drive tasks that need manual setup are excluded), but scores on the two versions are not directly comparable.

Q5.What other benchmarks exist besides OSWorld?+

The 2026 landscape is much broader than OSWorld alone. The core triad pairs OSWorld-Verified with BrowseComp (OpenAI, browsing depth) and Terminal-Bench 2.0 (CLI tasks). Enterprise workflow: TheAgentCompany (CMU) and WorkArena++ (ServiceNow). Browser-only: WebVoyager, Online-Mind2Web, Mind2Web 2, REAL (Browserbase). Mobile: AndroidWorld. Coding: SWE-Bench Pro, SWE-Bench Verified. Economic impact: GDPval. GUI grounding: ScreenSpot, ScreenSpot-Pro. General reasoning: GAIA.

Q6.Can any agent beat a human on these benchmarks?+

On OSWorld-Verified: yes, two models now cleanly beat the 72.4% human baseline (Holo3-35B-A3B at 80.4%, Kimi K2.6 at 73.1%). On WebVoyager: yes, Surfer 2 at 97.1% pass@1 is above expected human accuracy. On BrowseComp: Claude Mythos Preview at 86.9% but humans with internet access score ~80%. On SWE-Bench Verified: top models pass 87%+ of real GitHub issues. On AndroidWorld: models trail the ~80% human baseline, current SOTA is UI-TARS-2 at 75.8%. On GDPval: agents lose blind expert pairwise comparisons the majority of the time.

Q7.What's the safety picture in 2026?+

Prompt injection remains unsolved. OpenAI stated publicly in December 2025 that it 'may never be fully solved.' A joint OpenAI/Anthropic/DeepMind red-team study showed >90% bypass rate on every published defense under adaptive attack. CVE-2026-25253 demonstrated one-click RCE via a malicious webpage in a browser agent. Brave's security team found indirect prompt injection in Perplexity Comet within weeks of launch. Benchmark integrity is also fragile: a 2026 Berkeley RDI study showed 8 major agentic benchmarks could be gamed to near-100% via config leakage or DOM injection without actually solving tasks.

Q8.Which agents are open-source?+

The strongest open-source options in 2026: OpenHands (All Hands AI, sandbox+coding), Browser Use (Python library driving Chromium), Magnetic-One (Microsoft Research multi-agent), UI-TARS-2 (ByteDance, 53.1% OSWorld-Verified), Kimi K2.5/K2.6 (Moonshot, 63.3% / 73.1% OSWorld-Verified), Holo3 predecessor models (H Company research releases), GUI-Owl-1.5 32B (Alibaba, 55.4% OSWorld-Verified), Playwright MCP (Microsoft), Chrome DevTools MCP (Google). For the full list, toggle 'Open-source only' on the leaderboard above.

Editor's take — April 24, 2026

2026 is the year computer use stopped being a demo and started being a line item. Winners today: H Company on OSWorld-Verified (Holo3-35B-A3B, 80.4%), Anthropic on Terminal-Bench 2.0, OpenAI on browsing (BrowseComp + Mind2Web 2), H Company's Surfer 2 on WebVoyager (97.1%), OpenHands for open-source coding, Claude Code for terminal autonomy, Microsoft Copilot Studio for enterprise distribution. Still shipping: Meta, Apple, Amazon (no first-party CUA product yet). The big pattern: the harness — scaffold + sandbox + verifier + recovery — matters more than the model. Independent tests show Cursor's scaffold adds 16pp over the raw model. The open-source stack has caught and in some cases surpassed proprietary offerings: Kimi K2.6 (73.1% OSWorld-V) is the first open-source model to beat the human baseline.

Sources: OSWorld-Verified, XLANG Lab, BrowseComp, TheAgentCompany, Steel.dev, GDPval, BenchLM, REAL. Last reviewed April 24, 2026.