State of AI · 2026
As of April 2026: the current OSWorld-Verified SOTA is Holo3-35B-A3B (H Company) at 80.4% — the first model to cleanly beat the 72.4% human baseline. Claude Mythos Preview (Anthropic) leads BrowseComp at 86.9% and Terminal-Bench 2.0 at 92.1%. Surfer 2 (H Company) holds WebVoyager at 97.1%. Claude Opus 4.7 leads SWE-Bench Verified (87.6%) and SWE-Bench Pro (64.3%). Kimi K2.6 (Moonshot AI) is the strongest open-source computer-use agent at 73.1% on OSWorld-Verified. Largest AI data center in operation: Stargate Abilene (Texas, 1.2 GW planned, OpenAI + Oracle + SoftBank).
All numbers verified against primary sources. Click any row below to jump to our live entity page with evidence + article history. Sources: OSWorld-Verified, BrowseComp, Steel.dev, TheAgentCompany, GDPval.
Current SOTA on every major benchmark
Eleven verified benchmarks. All scores sourced from official leaderboards or maker publications. Updated April 24, 2026.
| Benchmark | Tasks | Human | SOTA | Leader | Maker | Date |
|---|---|---|---|---|---|---|
| OSWorld-Verified | 369 | 72.4% | 80.4% | Holo3-35B-A3B | H Company | 2026-04 |
| BrowseComp | 1266 | ~80% | 86.9% | Claude Mythos Preview | Anthropic | 2026-03 |
| WebVoyager | 643 | — | 97.1% | Surfer 2 | H Company | 2026-02 |
| Terminal-Bench 2.0 | 120 | — | 92.1% | Claude Mythos Preview | Anthropic | 2026-03 |
| SWE-Bench Verified | 500 | — | 87.6% | Claude Opus 4.7 | Anthropic | 2026-04 |
| SWE-Bench Pro | 731 | — | 64.3% | Claude Opus 4.7 | Anthropic | 2026-04 |
| TheAgentCompany | 175 | — | 30.0% | Claude Sonnet 4.6 | Anthropic | 2026-02 |
| WorkArena++ | 682 | — | 42.7% | Claude Opus 4.7 | Anthropic | 2026-04 |
| AndroidWorld | 116 | 80.0% | 75.8% | UI-TARS-2 | ByteDance Seed | 2025-10 |
| GDPval | 220 | — | 47.6% | GPT-5.4 | OpenAI | 2026-03 |
| ScreenSpot-Pro | 1581 | — | 85.4% | various | — | 2026 |
Top 5 OS-level computer-use agents (2026)
Screen-level control — takes screenshots, moves mouse, types. Ranked by OSWorld-Verified performance.
Holo3-35B-A3B
by H CompanyOSWorld-V SOTA 80.4%. First model past 72.4% human baseline.
Kimi K2.6
by Moonshot AIOSWorld-V 73.1%. Strongest open-source. 1T MoE.
Claude Sonnet 4.6
by AnthropicOSWorld-V 72.1%. General-purpose, not specialized.
Claude Computer Use
by AnthropicTool-calling API for OS control. Claude 4.x powered.
UI-TARS-2
by ByteDance SeedAndroidWorld SOTA 75.8%. Cross-platform GUI agent.
Full leaderboard: /computer-use — all 55 agents with per-benchmark breakdowns.
Top 6 AI coding agents (2026)
Ranked by SWE-Bench Pro + real-world adoption. Claude Opus 4.7 holds the model-level SOTA.
Claude Code
by Anthropic$20/mo Pro, API pay-as-you-goTerminal-native, Opus 4.7 + Sonnet 4.6. SOTA on SWE-Bench Pro (64.3%).
Cursor Agent
by Cursor / Anysphere$20/mo ProIDE-first, multi-model. Scaffold adds ~16pp over raw model.
Codex
by OpenAI$20/mo ChatGPT PlusGPT-5.4 powered, GitHub-native. Released Dec 2025.
Devin
by Cognition~$500/mo team, ACU meteredAutonomous software engineer. ACU-based pricing (~$9/hr Core).
OpenHands
by All Hands AIFree (OSS)Open-source, Docker-per-session. Strongest OSS coding agent.
GitHub Copilot Workspace
by Microsoft / GitHub$19/mo BusinessMulti-model, issue-to-PR loop. Enterprise SSO.
Top 6 browser agents (2026)
Scoped to the web. DOM + pixels. Faster + cheaper than OS-level but can't touch native apps.
Surfer 2
by H CompanyWebVoyager 97.1% SOTA (pass@1). 100% pass@10.
ChatGPT Atlas
by OpenAIAgent Mode built into ChatGPT. Desktop + web.
Perplexity Comet
by PerplexityResearch-first browser agent. Fast cite-everything model.
Claude for Chrome
by AnthropicExtension-based Claude Computer Use scoped to browser.
Dia
by The Browser CompanyChat-native browser. Tab-aware agent.
Browser Use
by Browser Use (OSS)Python library. Strongest open-source browser driver.
Biggest AI data centers (2026)
By planned capacity. GW = gigawatt; 1 GW is ~3× a typical hyperscale cluster.
| Rank | Name | Operator | Capacity | Location | Status |
|---|---|---|---|---|---|
| #1 | Stargate (Abilene) | OpenAI + Oracle + SoftBank | 1.2 GW planned | Texas, USA | Phase 1 live 2026-Q1 |
| #2 | xAI Colossus 2 | xAI | 1 GW (target 2 GW) | Memphis, USA | Expanding |
| #3 | Amazon Rainier | AWS + Anthropic | Multi-cluster, 400+ MW | Indiana, USA | Ramping |
| #4 | Anthropic Compute | Anthropic (AWS-hosted) | Millions of Trainium2 | US multi-region | Training + inference |
| #5 | Google TPU Campus | Alphabet / Google | TPU v5/v6p | US + Europe | Operational |
| #6 | Microsoft Copilot Fleet | Microsoft Azure | Multi-region, NVIDIA + AMD | Global | Operational |
Deep-dive: /ai-data-centers — 6 lesson pages, 130-term glossary, interactive cluster simulator.
6 trends defining AI in 2026
What changed materially over the past 12 months. Each point is sourced + quantified.
Computer Use moved from demo to product
H Company's Holo3-35B-A3B hit 80.4% on OSWorld-Verified (April 2026), first model to cleanly beat the 72.4% human baseline. Kimi K2.6 at 73.1% is the first open-source model above the human line.
The benchmark landscape was rewritten mid-flight
OSWorld-Verified (July 2025) replaced the original OSWorld (April 2024). Terminal-Bench 2.0 replaced 1.0. SWE-Bench Pro joined the standard triad alongside BrowseComp. Berkeley RDI showed 8 major benchmarks could be gamed to ~98%.
Open-source caught up on computer use
Moonshot's Kimi K2.6 (1T MoE, open weights) matches or exceeds Claude Sonnet 4.6 on OSWorld-Verified. ByteDance's UI-TARS-2 holds AndroidWorld SOTA. Alibaba's GUI-Owl 32B released open weights.
Prompt injection remains unsolved
OpenAI stated publicly December 2025 that prompt injection 'may never be fully solved.' Joint OpenAI/Anthropic/DeepMind red-team found >90% bypass rate on every published defense under adaptive attack.
Inference economics are the bottleneck
Stagehand extractions: $50-$200/day per 10,000 runs in LLM fees vs zero for deterministic Playwright. Devin: ~$9/hr Core. Per-action costs compound viciously at scale.
AI data centers became geopolitical
US export controls, EU Sovereign Cloud initiatives, and UAE/Saudi Arabia compute deals with OpenAI/Anthropic reshaped where the next 1GW+ clusters will sit.
Frequently asked
Q1.What is the best computer use agent in 2026?+
Holo3-35B-A3B from H Company leads OSWorld-Verified at 80.4% (April 2026) — the first model to cleanly beat the 72.4% human-expert baseline. Kimi K2.6 (Moonshot AI) is the strongest open-source option at 73.1%, and Claude Sonnet 4.6 is third at 72.1%.
Q2.What is the current OSWorld-Verified SOTA?+
80.4% by Holo3-35B-A3B (H Company) as of April 2026. The original OSWorld benchmark was published April 2024; OSWorld-Verified shipped July 2025 with 300+ task bugs fixed. See /computer-use for the live leaderboard.
Q3.Which AI model is best for coding in 2026?+
Claude Opus 4.7 (Anthropic) leads SWE-Bench Verified at 87.6% and SWE-Bench Pro at 64.3%. Claude Code is the dominant real-world coding wrapper. Cursor Agent, Codex (GPT-5.4), Devin, and the open-source OpenHands are strong alternatives.
Q4.What are the most important AI benchmarks in 2026?+
The agentic core triad: OSWorld-Verified + BrowseComp + Terminal-Bench 2.0. Enterprise workflow: TheAgentCompany, WorkArena++. Coding: SWE-Bench Pro + Verified. Mobile: AndroidWorld. Economic impact: GDPval (OpenAI). Browser: WebVoyager, Online-Mind2Web, REAL.
Q5.Is prompt injection solved?+
No. OpenAI stated publicly in December 2025 that prompt injection 'may never be fully solved.' A joint OpenAI/Anthropic/DeepMind red-team found >90% bypass rate on every published defense under adaptive attack. CVE-2026-25253 demonstrated one-click RCE via a malicious webpage against a browser agent.
Q6.What are the biggest AI data centers in 2026?+
Stargate Abilene (Texas, 1.2 GW planned, OpenAI + Oracle + SoftBank) has Phase 1 live in Q1 2026. xAI's Colossus 2 in Memphis is at 1 GW ramping to 2 GW. Amazon Rainier (Indiana, shared with Anthropic) is 400+ MW. Google TPU campuses, Microsoft Copilot Fleet, and Anthropic's AWS-hosted compute round out the top 6.
Q7.How much does it cost to run a computer-use agent at scale?+
10,000 Stagehand browser extractions per day costs $50-$200/day in LLM fees versus zero for deterministic Playwright. Devin bills on ACUs at roughly $9/hour on the Core tier. Per-action latency is 2-5s between vision call + reasoning + execution.
Q8.Where does this data come from?+
gentic.news runs 17+ AI agents scanning 89+ sources every 2 hours, building a living knowledge graph of 4,711 entities and 4,875 relationships. Every benchmark score is cross-checked against primary leaderboards. Our prediction scorecard (77.6% accuracy on 121 resolved) is public at /predictions.
Sources + go deeper
Computer Use
Live 2026 leaderboard
AI Data Centers
Infrastructure vertical
Benchmarks
All 19 evals
Predictions
77.6% accuracy scorecard
Intelligence
Weekly briefing
Entity graph
4,711 entities
Primary sources (verified): OSWorld-Verified, XLANG Lab (HKU), OpenAI BrowseComp, SWE-Bench, TheAgentCompany (CMU), WorkArena (ServiceNow), AndroidWorld (DeepMind), GDPval (OpenAI), BenchLM. Last updated April 24, 2026.