Skip to content
gentic.news — AI News Intelligence Platform
Quick AnswerUpdated April 24, 2026

State of AI · 2026

As of April 2026: the current OSWorld-Verified SOTA is Holo3-35B-A3B (H Company) at 80.4% — the first model to cleanly beat the 72.4% human baseline. Claude Mythos Preview (Anthropic) leads BrowseComp at 86.9% and Terminal-Bench 2.0 at 92.1%. Surfer 2 (H Company) holds WebVoyager at 97.1%. Claude Opus 4.7 leads SWE-Bench Verified (87.6%) and SWE-Bench Pro (64.3%). Kimi K2.6 (Moonshot AI) is the strongest open-source computer-use agent at 73.1% on OSWorld-Verified. Largest AI data center in operation: Stargate Abilene (Texas, 1.2 GW planned, OpenAI + Oracle + SoftBank).

55
Agents tracked
19
Benchmarks
80.4%
OSWorld SOTA
72.4%
Human baseline
15
Open-source

All numbers verified against primary sources. Click any row below to jump to our live entity page with evidence + article history. Sources: OSWorld-Verified, BrowseComp, Steel.dev, TheAgentCompany, GDPval.

Current SOTA on every major benchmark

Eleven verified benchmarks. All scores sourced from official leaderboards or maker publications. Updated April 24, 2026.

BenchmarkTasksHumanSOTALeaderMakerDate
OSWorld-Verified36972.4%80.4%Holo3-35B-A3BH Company2026-04
BrowseComp1266~80%86.9%Claude Mythos PreviewAnthropic2026-03
WebVoyager64397.1%Surfer 2H Company2026-02
Terminal-Bench 2.012092.1%Claude Mythos PreviewAnthropic2026-03
SWE-Bench Verified50087.6%Claude Opus 4.7Anthropic2026-04
SWE-Bench Pro73164.3%Claude Opus 4.7Anthropic2026-04
TheAgentCompany17530.0%Claude Sonnet 4.6Anthropic2026-02
WorkArena++68242.7%Claude Opus 4.7Anthropic2026-04
AndroidWorld11680.0%75.8%UI-TARS-2ByteDance Seed2025-10
GDPval22047.6%GPT-5.4OpenAI2026-03
ScreenSpot-Pro158185.4%various2026

Top 5 OS-level computer-use agents (2026)

Screen-level control — takes screenshots, moves mouse, types. Ranked by OSWorld-Verified performance.

Full leaderboard: /computer-use — all 55 agents with per-benchmark breakdowns.

Top 6 AI coding agents (2026)

Ranked by SWE-Bench Pro + real-world adoption. Claude Opus 4.7 holds the model-level SOTA.

Top 6 browser agents (2026)

Scoped to the web. DOM + pixels. Faster + cheaper than OS-level but can't touch native apps.

Biggest AI data centers (2026)

By planned capacity. GW = gigawatt; 1 GW is ~3× a typical hyperscale cluster.

RankNameOperatorCapacityLocationStatus
#1Stargate (Abilene)OpenAI + Oracle + SoftBank1.2 GW plannedTexas, USAPhase 1 live 2026-Q1
#2xAI Colossus 2xAI1 GW (target 2 GW)Memphis, USAExpanding
#3Amazon RainierAWS + AnthropicMulti-cluster, 400+ MWIndiana, USARamping
#4Anthropic ComputeAnthropic (AWS-hosted)Millions of Trainium2US multi-regionTraining + inference
#5Google TPU CampusAlphabet / GoogleTPU v5/v6pUS + EuropeOperational
#6Microsoft Copilot FleetMicrosoft AzureMulti-region, NVIDIA + AMDGlobalOperational

Deep-dive: /ai-data-centers — 6 lesson pages, 130-term glossary, interactive cluster simulator.

Frequently asked

Q1.What is the best computer use agent in 2026?+

Holo3-35B-A3B from H Company leads OSWorld-Verified at 80.4% (April 2026) — the first model to cleanly beat the 72.4% human-expert baseline. Kimi K2.6 (Moonshot AI) is the strongest open-source option at 73.1%, and Claude Sonnet 4.6 is third at 72.1%.

Q2.What is the current OSWorld-Verified SOTA?+

80.4% by Holo3-35B-A3B (H Company) as of April 2026. The original OSWorld benchmark was published April 2024; OSWorld-Verified shipped July 2025 with 300+ task bugs fixed. See /computer-use for the live leaderboard.

Q3.Which AI model is best for coding in 2026?+

Claude Opus 4.7 (Anthropic) leads SWE-Bench Verified at 87.6% and SWE-Bench Pro at 64.3%. Claude Code is the dominant real-world coding wrapper. Cursor Agent, Codex (GPT-5.4), Devin, and the open-source OpenHands are strong alternatives.

Q4.What are the most important AI benchmarks in 2026?+

The agentic core triad: OSWorld-Verified + BrowseComp + Terminal-Bench 2.0. Enterprise workflow: TheAgentCompany, WorkArena++. Coding: SWE-Bench Pro + Verified. Mobile: AndroidWorld. Economic impact: GDPval (OpenAI). Browser: WebVoyager, Online-Mind2Web, REAL.

Q5.Is prompt injection solved?+

No. OpenAI stated publicly in December 2025 that prompt injection 'may never be fully solved.' A joint OpenAI/Anthropic/DeepMind red-team found >90% bypass rate on every published defense under adaptive attack. CVE-2026-25253 demonstrated one-click RCE via a malicious webpage against a browser agent.

Q6.What are the biggest AI data centers in 2026?+

Stargate Abilene (Texas, 1.2 GW planned, OpenAI + Oracle + SoftBank) has Phase 1 live in Q1 2026. xAI's Colossus 2 in Memphis is at 1 GW ramping to 2 GW. Amazon Rainier (Indiana, shared with Anthropic) is 400+ MW. Google TPU campuses, Microsoft Copilot Fleet, and Anthropic's AWS-hosted compute round out the top 6.

Q7.How much does it cost to run a computer-use agent at scale?+

10,000 Stagehand browser extractions per day costs $50-$200/day in LLM fees versus zero for deterministic Playwright. Devin bills on ACUs at roughly $9/hour on the Core tier. Per-action latency is 2-5s between vision call + reasoning + execution.

Q8.Where does this data come from?+

gentic.news runs 17+ AI agents scanning 89+ sources every 2 hours, building a living knowledge graph of 4,711 entities and 4,875 relationships. Every benchmark score is cross-checked against primary leaderboards. Our prediction scorecard (77.6% accuracy on 121 resolved) is public at /predictions.

Sources + go deeper

Primary sources (verified): OSWorld-Verified, XLANG Lab (HKU), OpenAI BrowseComp, SWE-Bench, TheAgentCompany (CMU), WorkArena (ServiceNow), AndroidWorld (DeepMind), GDPval (OpenAI), BenchLM. Last updated April 24, 2026.