Skip to content
gentic.news — AI News Intelligence Platform

Computer Use · 2026

Updated April 23, 2026FAQ ↓

Agents that can actually drive your computer.

Twelve months ago no AI could operate a desktop better than a drunk intern — OSWorld scores sat under 15%. Today ChatGPT Agent leads OSWorld-Verified at 87.0%, clearing the 72.4% human-expert baseline. Two models now beat humans on OSWorld, three on WebVoyager, and a dozen on SWE-Bench Verified. The big story isn't one breakthrough — it's everyone shipping at once, and the benchmarks themselves getting rewritten mid-flight.

This page tracks 55 agents across 4 architectural categories, scored on 19 benchmarks. Scores come from the official OSWorld-Verified leaderboard, BrowseComp, Steel.dev, maker publications, and independent verification.

55
Agents tracked
19
Benchmarks covered
15
Open-source options
87.0%
OSWorld-V SOTA (vs 72.4% human)

The 4 types of computer use

Click a card to filter the leaderboard below.

Every Computer Use agent, by type

55 tracked agents across 4 types. Ranked within each category by peak benchmark.

Screen-level OS control

15 agents
AgentMakerLaunchOSWorldWebVoyagerSWE-BenchGAIAPricing

Claude Sonnet 4.6 is a multimodal large language model developed by Anthropic and released on February 25, 2026. Its documented performance

Anthropic2026-0272.1API: 3/15 per M tokens

Kimi K2.5 is an open-source, multimodal AI model from Moonshot AI, featuring 1 trillion parameters, vision capabilities, and Agent Swarm tec

Moonshot AI2026-0163.3API pay-as-you-go

The first public screen-level OS control API. Powered by Claude 3.5 → 4.x → Opus 4.7. Takes screenshots, identifies UI elements, emits raw c

Anthropic2024-1072.1Claude API — input $5/M, output $25/M

Moonshot AI's 1T-param MoE (32B active) built for long-horizon agentic coding (up to 13h continuous) with agent swarm scaling to 300 sub-age

Moonshot AI2026-0473.1API: 0.60/2.75 per M tokens

H Company's specialized OSWorld agent, currently #1 on OSWorld-Verified at 80.4% (2 runs, 100 steps). First model to explicitly beat the 72.

H Company2026-0480.4H Company enterprise

Claude Sonnet 4.5 (Sept 2025 release) on OSWorld-Verified at 62.9%. Benchmark milestone showing rapid improvement from prior Claude generati

Anthropic2025-0962.9Legacy Anthropic API

ByteDance Seed's 1.8 general model on OSWorld-Verified at 61.9%. ByteDance has invested heavily in computer-use via the UI-TARS + Seed linea

ByteDance Seed2025-1261.9Doubao ecosystem

Meituan's LongCat Team specialized CUA research model. OSWorld-Verified 56.7% at max 50 steps — strong for a step-constrained evaluation.

Meituan LongCat2026-0156.7Research

Alibaba Tongyi Lab's Mobile-Agent Team specialized model. OSWorld-Verified 55.4% at max 50 steps. Open weights, Chinese research group.

Alibaba Tongyi Lab2026-0355.4Free (OSS)

Mininglamp's DeepMiner-Mano specialized 72B GUI agent. OSWorld-Verified at 53.9% with max 100 steps.

Mininglamp Technology2025-1053.9Research

ByteDance Seed's second-generation UI-TARS specialized GUI agent. OSWorld-Verified at 53.1%, max 100 steps.

ByteDance Seed2025-1053.1ByteDance ecosystem

University of Hong Kong × Moonshot AI joint open-source CUA model. OSWorld-Verified at 45.0% (3-run average, 100 steps). Most-cited open-wei

HKU & Moonshot AI2025-1045.0Free (OSS)

OpenAI's Operator/CUA specialized model as benchmarked on OSWorld-Verified at 31.3% (max 50 steps). Now folded into ChatGPT Agent.

OpenAI2025-0131.3ChatGPT Pro

Anthropic's original computer-use preview from Oct 2024, scored at 31.3% on OSWorld-Verified with max 50 steps. Proved the paradigm; since s

Anthropic2024-1031.3Anthropic API

Enterprise computer-using agents GA April 2026. Choice of Claude Sonnet 4.5 or OpenAI CUA backends. Built-in credentials vault, Purview audi

Microsoft2026-04Copilot Studio subscription

Browser-only

19 agents
AgentMakerLaunchOSWorldWebVoyagerSWE-BenchGAIAPricing

French startup H Company's proprietary enterprise agent. Current WebVoyager SOTA at 97.1%. Top of Steel.dev public leaderboard.

H Company2026-0297.1Enterprise contract

Open-source Python library (50k+ GitHub stars). Multi-tab, memory, parallel agents. BYO-LLM. 89.1% on WebVoyager by independent eval.

Browser Use (OSS)2024-1089.1Free (OSS)

Independent browser agent focused on test automation. #2 on WebVoyager at 93.9%, behind only Surfer 2.

Magnitude2025-0693.9Contact sales

OpenAI's unified agent (absorbed Operator January 2026). Runs Chrome via vision-action loop with GPT-5-class reasoning. Requires ChatGPT Pro

OpenAI2026-0187.0ChatGPT Pro $200/mo; Plus $20/mo (waitlisted)

Open-source browser agent. 85.85% WebVoyager, best-in-class on 'WRITE' form-filling tasks. Workflow chaining. Strong RPA replacement positio

Skyvern (Y Combinator)2024-0485.8Open-source; enterprise contract

MIT-licensed SDK. V3 rewrite (Feb 2026) uses Chrome DevTools Protocol directly, 44% faster than v2. AI-native caching + self-healing DOM.

Browserbase2024-0685.8Free SDK + Browserbase cloud ($99/mo+)

Chrome-integrated agent powered by Gemini 2.0 → 3.x. Runs 10 concurrent VM tasks. Available via Google AI Ultra subscription. Strongest on S

Google DeepMind2024-1283.5Google AI Ultra subscription

Microsoft's official MCP server wrapping Playwright. Exposes Chromium/Firefox/WebKit as MCP tools so any MCP client (Claude Code, Cursor, Co

Microsoft2025-03Free (Apache-2.0)

Once a top-3 browser agent. Notably absent from 2026 leaderboards — overtaken by Surfer 2, Magnitude, and AIME.

Multi-On2023-11Subscription

Google Chrome team's official MCP server exposing DevTools Protocol to AI agents. Used by Claude Code + Cursor for browser automation + perf

Google2025-09Free (Apache-2.0)

Popular community MCP server that connects any MCP-compatible AI agent to a live Chrome browser via extension + local bridge. Lightweight al

hangwin2025-06Free (MIT)

Anthropic's Chrome extension. Research preview August 2025, expanded to Max users November 2025, all paid users December 2025. Reads and act

Anthropic2025-08Claude Pro ($20) / Max ($100+)

OpenAI's Chromium-based AI-native browser. macOS at launch, rolling to Windows/iOS/Android. Persistent ChatGPT sidebar + Agent Mode for auto

OpenAI2025-10Free sidebar; Agent Mode via ChatGPT Plus/Pro

AI-native browser from Perplexity. Started as Max-only ($200/mo), went free globally October 2, 2025. Assistant can execute multi-step workf

Perplexity2025-07Free · Comet Plus $5 · Pro $20 · Max $200

Browser Company's AI-first browser, successor to Arc (sunset May 2025). Acquired by Atlassian in 2025. Persistent AI command bar + skills wo

The Browser Company2025-06Free · Dia Pro $20/mo

Silicon Valley agentic browser. Fellou CE launched September 2025 as 'world's first spatial agentic browser'. Built-in Research + Shopping +

Fellou Inc2025-09Freemium + paid

Opera's new agentic browser (2025 version, not to be confused with 2017 Neon). Invite-only Sep 2025, public Dec 2025. Subscription model. Ru

Opera2025-12$19.90/mo

Brave's built-in AI. Agentic 'AI Browsing' mode shipped December 2025 (Nightly 1.86+). Privacy-first (no account for freemium tier). Chromiu

Brave2023-11Free · Leo Premium $15/mo

Managed headless-browser infrastructure — 'AWS for headless browsers'. $40M Series B June 2025 at $300M valuation. Residential IPs + CAPTCHA

Browserbase2024-06Usage-based, $99/mo+

Sandboxed VM / container

16 agents
AgentMakerLaunchOSWorldWebVoyagerSWE-BenchGAIAPricing

Enterprise agent from Writer (content platform). GAIA Level 3 leader at 61% — surpassed Manus mid-2025.

Writer2025-0961.0Enterprise contract

Chinese multi-agent (Executor + Planner + Knowledge) with 29 tools, per-session isolated Linux sandbox. Desktop launched March 2026 with dir

Butterfly Effect2025-0357.7Subscription (invite-only at launch)

MIT-licensed open-source agent (formerly OpenDevin). Docker-based sandbox, BYO-LLM. 53%+ on SWE-bench Verified with Claude. Top-3 consistent

All-Hands AI2024-0353.0Free (self-hosted) or All-Hands Cloud

AI full-stack app builder (formerly GPT Engineer). Viral 2025. Supabase integration out of the box. Built for non-technical founders.

Lovable2024-11Free + Pro $25/mo

Cloud development agent with parallel tasks, branching, sub-agent spawning. Effort-based pricing. Mix of Claude/GPT/Gemini backends.

Replit2026-03Core $25/mo + credits; Pro $100/mo + credits
E2BOSS

Open-source secure cloud runtime for AI agents. Firecracker microVMs. Python/JS SDKs + custom templates. Used as the sandbox backend by many

E2B2023-11Free tier + usage-based

Open-source ephemeral dev environments, pivoted to AI-agent runtime in 2025. Self-hostable. Used for per-session Linux sandboxes.

Daytona2024-05Free (OSS) + cloud paid

Serverless sandbox runtime for agent code execution. Sub-second cold starts. Popular as execution backend for research agents.

Modal Labs2024-09$0.00003942/CPU-sec

AWS's managed agent platform (GA Oct 2025, preview July 2025). Includes Browser tool, Code Interpreter, Gateway, Memory. Any model via Bedro

Amazon2025-10AWS usage-based

Google's enterprise agent platform, rebranded from Agentspace at Cloud Next 2026. Managed agents + ADK (Agent Development Kit, open source).

Google2024-12Enterprise contract

Salesforce's enterprise agent platform. v1 Sep 2024; Agentforce 360 for AWS early 2026. Deep CRM integration, guardrails, per-conversation p

Salesforce2024-09Per-conversation

IBM's enterprise agent platform with 150+ pre-built agents in the Agent Catalog. Granite models under the hood. Targets regulated industries

IBM2025-05Enterprise contract

Enterprise AI agents baked into ServiceNow workflows. IBM Granite integration from May 2024. ITSM + HR + customer service automation.

ServiceNow2023-09Enterprise contract

Browser-based AI full-stack builder. Runs npm install + code in WebContainer directly in the browser. Bolt V2 shipped 2025.

StackBlitz2024-10Credits-based, $20/mo Pro

Google's browser-based AI development environment (rebrand of Project IDX). Firebase + Gemini + AI Studio integration. Free tier available.

Google2025-04Free tier + usage

Full-stack agent with live preview sandbox. Built-in hosting + auth + database. Aggressive pricing vs Cursor/Replit/Bolt.

Emergent2025-01Free + paid tiers

Coding-focused

5 agents
AgentMakerLaunchOSWorldWebVoyagerSWE-BenchGAIAPricing

Open-source research agent (NeurIPS 2024). Mini-SWE-Agent scores >74% on SWE-bench Verified in 100 lines of Python, no tool-calling needed.

Princeton + Stanford2024-0474.0Free (OSS)

Original 'AI software engineer'. Dropped from $500/mo → $20/mo Core in April 2025. SWE-1.5 in-house model scores 40.08% on SWE-bench Pro. 67

Cognition2024-03Core $20/mo + $2.25/ACU; Team $500/mo

Terminal-first AI pair programmer. Git-integrated. Batch editing. BYO-LLM. Popular in the local-LLM community.

Aider (OSS)2023-05Free (OSS)

IDE-first agent inside Cursor. Adds 16pp over raw model via scaffold. Reports 70% on CursorBench with Opus 4.7. Doesn't publish SWE-bench Ve

Anysphere2025-06$20/mo Pro

Vercel's generative UI + coding agent. Creates React/Next.js from natural language. v0 Agent connects to existing repos.

Vercel2023-10Credits, $20/mo Pro

Scores from official publications + independent leaderboards (Steel.dev, OSWorld, SWE-Bench). Dash = not published.

How the benchmarks evolved — 2023 → 2026

The benchmark landscape changed faster than the models did. OSWorld v1 was April 2024. By July 2025 it had been replaced by OSWorld-Verified. The 2026 consensus settled around three benchmarks, not one.

  1. 2023WebArena + Mind2Web

    First generation. DOM-only. Fully gamed by 2025.

  2. Apr 2024OSWorld (v1)

    XLANG Lab's first real-desktop VM benchmark. 369 tasks.

  3. Jun 2024WebVoyager + GAIA

    Web agents + general reasoning. GPT-4V as judge (later criticized).

  4. Dec 2024TheAgentCompany

    CMU: simulated startup. Browse + code + Slack. First multi-surface benchmark.

  5. Apr 2025BrowseComp

    OpenAI: 1,266 research-depth browsing problems. Grounded in factuality.

  6. Jul 2025OSWorld-Verified

    XLANG Lab revised + cleaned. AWS infra, 50× parallel. Fixed 300+ task bugs.

  7. Sep 2025GDPval

    OpenAI: 44 occupations, blinded expert judging of deliverables. Economic impact.

  8. Oct 2025SWE-Bench Pro + Terminal-Bench 2.0

    Held-out, contamination-resistant successors to the gamed originals.

  9. 2026Core triad consolidates

    OSWorld-Verified + BrowseComp + Terminal-Bench 2.0 = weighted agentic score.

Every benchmark that matters — what each one actually measures

A 2026 Berkeley RDI study showed 8 major leaderboards could be gamed to near-100% via config leakage or DOM injection. We only trust Verified, Pro, or programmatically-checked variants with held-out test sets.

How they actually work

The winning 2026 pattern is the modular stack: Planner + Grounder + Executor + Memory + Verifier. Simular's Agent S2 showed that splitting UI reading, planning, and low-level clicking into separate modules beats any monolithic model on long tasks.

Perception

Vision-language models (Claude 4.x, GPT-5.4, Gemini 3) process screenshots. Hybrid DOM + pixel is now standard for browser agents.

Action grounding

Function calls, Code-as-Action (CodeAct), constrained UI DSLs, or raw VLA. CodeAct beats JSON tool-calling on complex tasks.

Planning

ReAct is the backbone. LLMCompiler runs a DAG with 3.6× parallel speedup. Hierarchical decomposition + Reflexion for complex tasks.

Sandboxing

Docker per-session (OpenHands), managed cloud browsers (Browserbase, $300M valuation), e2b, Firecracker microVMs.

Error recovery

Screenshot diffing, retry-with-variation, self-healing DOM layers (Stagehand v3) that adapt to layout shifts.

UI grounding

Mapping “click the blue Buy button” to X/Y coordinates is the #1 failure mode. Cascaded search (ScreenSeekeR) pushed SOTA to 48.7%.

Still unsolved — the safety ceiling

  • Prompt injection: OpenAI admitted December 2025 it “may never be fully solved.” Joint OpenAI/Anthropic/DeepMind study: >90% bypass rate on every published defense under adaptive attack. CVE-2026-25253: one-click RCE via malicious webpage visit. Brave found indirect prompt injection in Perplexity Comet within weeks of launch.
  • Benchmark integrity: Berkeley RDI study showed GAIA, OSWorld (pre-Verified), SWE-Bench, Terminal-Bench 1.0 and others could be gamed to ~98% without actually solving tasks. GAIA validation answers are public on HuggingFace. Trust Pro / Verified variants with held-out test sets.
  • Economics: 10,000 Stagehand extractions/day = $50–$200/day in LLM fees vs zero for deterministic Playwright. Devin ACU math: 1 hour = ~$9 on Core. Per-action costs compound viciously at scale.
  • Speed: 2–5 seconds per action (vision call + reasoning + execution). Stagehand v3 got 44% faster but inference latency is the physical lower bound.
  • Captcha + TOS: Cloudflare, Arkose, hCaptcha now score AI-browser traffic patterns. LinkedIn, airline booking, banks actively block. Legal status of automated access remains unresolved.

Frequently asked

Q1.What is a Computer Use agent?+

A Computer Use agent is an AI system that can operate a computer the way a human does — reading screenshots, moving a mouse, typing on a keyboard, running shell commands, or driving a browser. In 2026 the category splits into four types: screen-level OS control (Claude Computer Use, Holo3, Kimi K2.6), browser-only agents (ChatGPT Atlas, Perplexity Comet, Surfer 2), sandboxed VMs / containers (OpenHands, E2B, Browserbase), and coding-focused agents (Claude Code, Devin, Cursor Agent, Codex).

Q2.What is the current OSWorld SOTA in 2026?+

As of April 2026, the OSWorld-Verified leaderboard is led by Holo3-35B-A3B from H Company at 80.4%. Holo3 was the first model to cleanly beat the 72.4% human-expert baseline on the verified split. Kimi K2.6 (Moonshot AI) is second at 73.1% as a general-purpose model, and Claude Sonnet 4.6 is third at 72.1% — effectively tied with the human baseline.

Q3.Is OSWorld still the right benchmark, or is it outdated?+

OSWorld was refreshed in July 2025 as OSWorld-Verified, which fixed 300+ task bugs, moved infrastructure to AWS for 50× parallelization, and cleaned evaluation robustness. The original April 2024 paper version is no longer considered authoritative. OSWorld-Verified is now part of the 2026 canonical triad (with BrowseComp and Terminal-Bench 2.0) that weighted agentic leaderboards use. That said, it's only one dimension — enterprise workflows need TheAgentCompany or WorkArena++, mobile needs AndroidWorld, and economic-impact testing needs GDPval.

Q4.How is OSWorld-Verified different from the original OSWorld?+

OSWorld was released April 2024 by XLANG Lab (University of Hong Kong). OSWorld-Verified shipped July 2025 and systematically addressed 300+ community-reported issues: broken web tasks whose target sites changed, ambiguous instructions, and evaluation edge cases. It also migrated from VMware/Docker to AWS for managed parallelization and cut full-suite evaluation time to under an hour. Both cover the same 369 tasks (361 if the 8 Google Drive tasks that need manual setup are excluded), but scores on the two versions are not directly comparable.

Q5.What other benchmarks exist besides OSWorld?+

The 2026 landscape is much broader than OSWorld alone. The core triad pairs OSWorld-Verified with BrowseComp (OpenAI, browsing depth) and Terminal-Bench 2.0 (CLI tasks). Enterprise workflow: TheAgentCompany (CMU) and WorkArena++ (ServiceNow). Browser-only: WebVoyager, Online-Mind2Web, Mind2Web 2, REAL (Browserbase). Mobile: AndroidWorld. Coding: SWE-Bench Pro, SWE-Bench Verified. Economic impact: GDPval. GUI grounding: ScreenSpot, ScreenSpot-Pro. General reasoning: GAIA.

Q6.Can any agent beat a human on these benchmarks?+

On OSWorld-Verified: yes, two models now cleanly beat the 72.4% human baseline (Holo3-35B-A3B at 80.4%, Kimi K2.6 at 73.1%). On WebVoyager: yes, Surfer 2 at 97.1% pass@1 is above expected human accuracy. On BrowseComp: Claude Mythos Preview at 86.9% but humans with internet access score ~80%. On SWE-Bench Verified: top models pass 87%+ of real GitHub issues. On AndroidWorld: models trail the ~80% human baseline, current SOTA is UI-TARS-2 at 75.8%. On GDPval: agents lose blind expert pairwise comparisons the majority of the time.

Q7.What's the safety picture in 2026?+

Prompt injection remains unsolved. OpenAI stated publicly in December 2025 that it 'may never be fully solved.' A joint OpenAI/Anthropic/DeepMind red-team study showed >90% bypass rate on every published defense under adaptive attack. CVE-2026-25253 demonstrated one-click RCE via a malicious webpage in a browser agent. Brave's security team found indirect prompt injection in Perplexity Comet within weeks of launch. Benchmark integrity is also fragile: a 2026 Berkeley RDI study showed 8 major agentic benchmarks could be gamed to near-100% via config leakage or DOM injection without actually solving tasks.

Q8.Which agents are open-source?+

The strongest open-source options in 2026: OpenHands (All Hands AI, sandbox+coding), Browser Use (Python library driving Chromium), Magnetic-One (Microsoft Research multi-agent), UI-TARS-2 (ByteDance, 53.1% OSWorld-Verified), Kimi K2.5/K2.6 (Moonshot, 63.3% / 73.1% OSWorld-Verified), Holo3 predecessor models (H Company research releases), GUI-Owl-1.5 32B (Alibaba, 55.4% OSWorld-Verified), Playwright MCP (Microsoft), Chrome DevTools MCP (Google). For the full list, toggle 'Open-source only' on the leaderboard above.

Editor's take — April 23, 2026

2026 is the year computer use stopped being a demo and started being a line item. Winners today: H Company on OSWorld-Verified (Holo3-35B-A3B, 80.4%), Anthropic on Terminal-Bench 2.0, OpenAI on browsing (BrowseComp + Mind2Web 2), H Company's Surfer 2 on WebVoyager (97.1%), OpenHands for open-source coding, Claude Code for terminal autonomy, Microsoft Copilot Studio for enterprise distribution. Still shipping: Meta, Apple, Amazon (no first-party CUA product yet). The big pattern: the harness — scaffold + sandbox + verifier + recovery — matters more than the model. Independent tests show Cursor's scaffold adds 16pp over the raw model. The open-source stack has caught and in some cases surpassed proprietary offerings: Kimi K2.6 (73.1% OSWorld-V) is the first open-source model to beat the human baseline.

Sources: OSWorld-Verified, XLANG Lab, BrowseComp, TheAgentCompany, Steel.dev, GDPval, BenchLM, REAL. Last reviewed April 23, 2026.