What is the current SOTA on OSWorld-Verified?

Holo3-35B-A3B from H Company holds the OSWorld-Verified record at 80.4% (April 2026), against a 72.4% human-expert baseline. The original OSWorld benchmark was published April 2024; OSWorld-Verified shipped July 2025 after 300+ community-reported task bugs were fixed.

Quick AnswerUpdated April 24, 2026

State of AI · 2026

As of April 2026: the current OSWorld-Verified SOTA is Holo3-35B-A3B (H Company) at 80.4% — the first model to cleanly beat the 72.4% human baseline. Claude Mythos Preview (Anthropic) leads BrowseComp at 86.9% and Terminal-Bench 2.0 at 92.1%. Surfer 2 (H Company) holds WebVoyager at 97.1%. Claude Opus 4.7 leads SWE-Bench Verified (87.6%) and SWE-Bench Pro (64.3%). Kimi K2.6 (Moonshot AI) is the strongest open-source computer-use agent at 73.1% on OSWorld-Verified. Largest AI data center in operation: Stargate Abilene (Texas, 1.2 GW planned, OpenAI + Oracle + SoftBank).

Agents tracked

Benchmarks

80.4%

OSWorld SOTA

72.4%

Human baseline

Open-source

All numbers verified against primary sources. Click any row below to jump to our live entity page with evidence + article history. Sources: OSWorld-Verified, BrowseComp, Steel.dev, TheAgentCompany, GDPval.

Current SOTA on every major benchmark

Eleven verified benchmarks. All scores sourced from official leaderboards or maker publications. Updated April 24, 2026.

Benchmark	Tasks	Human	SOTA	Leader	Maker	Date
OSWorld-Verified	369	72.4%	80.4%	Holo3-35B-A3B	H Company	2026-04
BrowseComp	1266	~80%	86.9%	Claude Mythos Preview	Anthropic	2026-03
WebVoyager	643	—	97.1%	Surfer 2	H Company	2026-02
Terminal-Bench 2.0	120	—	92.1%	Claude Mythos Preview	Anthropic	2026-03
SWE-Bench Verified	500	—	87.6%	Claude Opus 4.7	Anthropic	2026-04
SWE-Bench Pro	731	—	64.3%	Claude Opus 4.7	Anthropic	2026-04
TheAgentCompany	175	—	30.0%	Claude Sonnet 4.6	Anthropic	2026-02
WorkArena++	682	—	42.7%	Claude Opus 4.7	Anthropic	2026-04
AndroidWorld	116	80.0%	75.8%	UI-TARS-2	ByteDance Seed	2025-10
GDPval	220	—	47.6%	GPT-5.4	OpenAI	2026-03
ScreenSpot-Pro	1581	—	85.4%	various	—	2026

Top 5 OS-level computer-use agents (2026)

Screen-level control — takes screenshots, moves mouse, types. Ranked by OSWorld-Verified performance.

Holo3-35B-A3B

by H Company

OSWorld-V SOTA 80.4%. First model past 72.4% human baseline.

Kimi K2.6

by Moonshot AI

OSWorld-V 73.1%. Strongest open-source. 1T MoE.

Claude Sonnet 4.6

by Anthropic

OSWorld-V 72.1%. General-purpose, not specialized.

Claude Computer Use

by Anthropic

Tool-calling API for OS control. Claude 4.x powered.

UI-TARS-2

by ByteDance Seed

AndroidWorld SOTA 75.8%. Cross-platform GUI agent.

Full leaderboard: /computer-use — all 55 agents with per-benchmark breakdowns.

Top 6 AI coding agents (2026)

Ranked by SWE-Bench Pro + real-world adoption. Claude Opus 4.7 holds the model-level SOTA.

Claude Code

by Anthropic$20/mo Pro, API pay-as-you-go

Terminal-native, Opus 4.7 + Sonnet 4.6. SOTA on SWE-Bench Pro (64.3%).

Cursor Agent

by Cursor / Anysphere$20/mo Pro

IDE-first, multi-model. Scaffold adds ~16pp over raw model.

Codex

by OpenAI$20/mo ChatGPT Plus

GPT-5.4 powered, GitHub-native. Released Dec 2025.

Devin

by Cognition~$500/mo team, ACU metered

Autonomous software engineer. ACU-based pricing (~$9/hr Core).

OpenHands

by All Hands AIFree (OSS)

Open-source, Docker-per-session. Strongest OSS coding agent.

GitHub Copilot Workspace

by Microsoft / GitHub$19/mo Business

Multi-model, issue-to-PR loop. Enterprise SSO.

Top 6 browser agents (2026)

Scoped to the web. DOM + pixels. Faster + cheaper than OS-level but can't touch native apps.

Surfer 2

by H Company

WebVoyager 97.1% SOTA (pass@1). 100% pass@10.

ChatGPT Atlas

by OpenAI

Agent Mode built into ChatGPT. Desktop + web.

Perplexity Comet

by Perplexity

Research-first browser agent. Fast cite-everything model.

Claude for Chrome

by Anthropic

Extension-based Claude Computer Use scoped to browser.

Dia

by The Browser Company

Chat-native browser. Tab-aware agent.

Browser Use

by Browser Use (OSS)

Python library. Strongest open-source browser driver.

Biggest AI data centers (2026)

By planned capacity. GW = gigawatt; 1 GW is ~3× a typical hyperscale cluster.

Rank	Name	Operator	Capacity	Location	Status
#1	Stargate (Abilene)	OpenAI + Oracle + SoftBank	1.2 GW planned	Texas, USA	Phase 1 live 2026-Q1
#2	xAI Colossus 2	xAI	1 GW (target 2 GW)	Memphis, USA	Expanding
#3	Amazon Rainier	AWS + Anthropic	Multi-cluster, 400+ MW	Indiana, USA	Ramping
#4	Anthropic Compute	Anthropic (AWS-hosted)	Millions of Trainium2	US multi-region	Training + inference
#5	Google TPU Campus	Alphabet / Google	TPU v5/v6p	US + Europe	Operational
#6	Microsoft Copilot Fleet	Microsoft Azure	Multi-region, NVIDIA + AMD	Global	Operational

Deep-dive: /ai-data-centers — 6 lesson pages, 130-term glossary, interactive cluster simulator.

6 trends defining AI in 2026

What changed materially over the past 12 months. Each point is sourced + quantified.

Trend 1

Computer Use moved from demo to product

H Company's Holo3-35B-A3B hit 80.4% on OSWorld-Verified (April 2026), first model to cleanly beat the 72.4% human baseline. Kimi K2.6 at 73.1% is the first open-source model above the human line.

Trend 2

The benchmark landscape was rewritten mid-flight

OSWorld-Verified (July 2025) replaced the original OSWorld (April 2024). Terminal-Bench 2.0 replaced 1.0. SWE-Bench Pro joined the standard triad alongside BrowseComp. Berkeley RDI showed 8 major benchmarks could be gamed to ~98%.

Trend 3

Open-source caught up on computer use

Moonshot's Kimi K2.6 (1T MoE, open weights) matches or exceeds Claude Sonnet 4.6 on OSWorld-Verified. ByteDance's UI-TARS-2 holds AndroidWorld SOTA. Alibaba's GUI-Owl 32B released open weights.

Trend 4

Prompt injection remains unsolved

OpenAI stated publicly December 2025 that prompt injection 'may never be fully solved.' Joint OpenAI/Anthropic/DeepMind red-team found >90% bypass rate on every published defense under adaptive attack.

Trend 5

Inference economics are the bottleneck

Stagehand extractions: $50-$200/day per 10,000 runs in LLM fees vs zero for deterministic Playwright. Devin: ~$9/hr Core. Per-action costs compound viciously at scale.

Trend 6

AI data centers became geopolitical

US export controls, EU Sovereign Cloud initiatives, and UAE/Saudi Arabia compute deals with OpenAI/Anthropic reshaped where the next 1GW+ clusters will sit.

Frequently asked

Q1.What is the best computer use agent in 2026?+

Holo3-35B-A3B from H Company leads OSWorld-Verified at 80.4% (April 2026) — the first model to cleanly beat the 72.4% human-expert baseline. Kimi K2.6 (Moonshot AI) is the strongest open-source option at 73.1%, and Claude Sonnet 4.6 is third at 72.1%.

Q2.What is the current OSWorld-Verified SOTA?+

80.4% by Holo3-35B-A3B (H Company) as of April 2026. The original OSWorld benchmark was published April 2024; OSWorld-Verified shipped July 2025 with 300+ task bugs fixed. See /computer-use for the live leaderboard.

Q3.Which AI model is best for coding in 2026?+

Claude Opus 4.7 (Anthropic) leads SWE-Bench Verified at 87.6% and SWE-Bench Pro at 64.3%. Claude Code is the dominant real-world coding wrapper. Cursor Agent, Codex (GPT-5.4), Devin, and the open-source OpenHands are strong alternatives.

Q4.What are the most important AI benchmarks in 2026?+

The agentic core triad: OSWorld-Verified + BrowseComp + Terminal-Bench 2.0. Enterprise workflow: TheAgentCompany, WorkArena++. Coding: SWE-Bench Pro + Verified. Mobile: AndroidWorld. Economic impact: GDPval (OpenAI). Browser: WebVoyager, Online-Mind2Web, REAL.

Q5.Is prompt injection solved?+

No. OpenAI stated publicly in December 2025 that prompt injection 'may never be fully solved.' A joint OpenAI/Anthropic/DeepMind red-team found >90% bypass rate on every published defense under adaptive attack. CVE-2026-25253 demonstrated one-click RCE via a malicious webpage against a browser agent.

Q6.What are the biggest AI data centers in 2026?+

Stargate Abilene (Texas, 1.2 GW planned, OpenAI + Oracle + SoftBank) has Phase 1 live in Q1 2026. xAI's Colossus 2 in Memphis is at 1 GW ramping to 2 GW. Amazon Rainier (Indiana, shared with Anthropic) is 400+ MW. Google TPU campuses, Microsoft Copilot Fleet, and Anthropic's AWS-hosted compute round out the top 6.

Q7.How much does it cost to run a computer-use agent at scale?+

10,000 Stagehand browser extractions per day costs $50-$200/day in LLM fees versus zero for deterministic Playwright. Devin bills on ACUs at roughly $9/hour on the Core tier. Per-action latency is 2-5s between vision call + reasoning + execution.

Q8.Where does this data come from?+

gentic.news runs 17+ AI agents scanning 89+ sources every 2 hours, building a living knowledge graph of 4,711 entities and 4,875 relationships. Every benchmark score is cross-checked against primary leaderboards. Our prediction scorecard (77.6% accuracy on 121 resolved) is public at /predictions.

Sources + go deeper

Computer Use

Live 2026 leaderboard

AI Data Centers

Infrastructure vertical

Benchmarks

All 19 evals

Predictions

77.6% accuracy scorecard

Intelligence

Weekly briefing

Entity graph

4,711 entities

Primary sources (verified): OSWorld-Verified, XLANG Lab (HKU), OpenAI BrowseComp, SWE-Bench, TheAgentCompany (CMU), WorkArena (ServiceNow), AndroidWorld (DeepMind), GDPval (OpenAI), BenchLM. Last updated April 24, 2026.