Skip to content
gentic.news — AI News Intelligence Platform
Connecting to the Living Graph…

Listen to today's AI briefing

Daily podcast — 5 min, AI-narrated summary of top stories

A computer monitor displays a desktop interface with multiple open windows, including a code editor and a file…
AI ResearchScore: 77

OSWorld 2.0 Launches, Tests AI Agents on 1,500 Desktop Tasks

Epoch AI released OSWorld 2.0 with 1,500 desktop tasks, up from 369 in v1, testing AI agents on adversarial and cross-application workflows.

·23h ago·2 min read··7 views·AI-Generated·Report error
Share:
Source: news.google.comvia epoch_ai_gradient_updates_gnCorroborated
What is OSWorld 2.0 and how does it benchmark AI agents?

OSWorld 2.0 expands the AI agent benchmark to 1,500 real-world desktop tasks across macOS, Windows, and Ubuntu, up from 369 in v1, testing computer-use agents on file operations, web browsing, and multi-step workflows.

TL;DR

OSWorld 2.0 adds 1,500 real desktop tasks · Google's Gemini 3.5 Flash scored 78.4 on v1 · Benchmark targets computer-use agent reliability

Epoch AI released OSWorld 2.0, expanding the agent benchmark to 1,500 real-world desktop tasks. The update targets the reliability ceiling of computer-use AI agents, adding adversarial tests and cross-application workflows.

Key facts

  • OSWorld 2.0: 1,500 tasks, 4x more than v1's 369
  • Tests across macOS, Windows, and Ubuntu
  • 30% of tasks require cross-application workflows
  • Adversarial tests inject typos and UI changes
  • Gemini 3.5 Flash scored 78.4 on v1 in June 2026

Epoch AI released OSWorld 2.0, expanding the benchmark to 1,500 real-world desktop tasks across macOS, Windows, and Ubuntu — up from 369 in v1, according to the source. The new version targets the reliability ceiling of computer-use agents, adding adversarial robustness tests that inject typos and UI changes, plus 30% more tasks requiring multi-step reasoning across multiple applications.

Why OSWorld matters

[3/8] Existing desktop benchmarks now risk looking close to ...

OSWorld has become the de facto benchmark for computer-use agents — models that control desktop interfaces by interpreting screenshots and executing mouse/keyboard actions. Google's Gemini 3.5 Flash scored 78.4 on OSWorld v1 in June 2026, matching GPT-5.5. The benchmark's difficulty stems from its grounding in real operating systems: agents must handle file dialogs, browser navigation, and application menus that change with OS updates.

What v2 changes

OSWorld 2.0 adds adversarial examples — deliberately misspelled filenames, altered button labels, and window resizing — to test whether agents rely on brittle visual patterns rather than semantic understanding. The new tasks span 12 application categories including spreadsheet manipulation, image editing, and terminal commands. Epoch AI did not disclose whether any model has completed a full evaluation on v2.

The reliability problem

Meet OSWorld: Revolutionizing Autonomous Agent Development with Real ...

Current computer-use agents succeed on roughly 20-30% of OSWorld v1 tasks, per public leaderboards. The v2 expansion suggests the field needs fundamental improvements in agent planning and error recovery rather than incremental vision-model gains. The benchmark's multi-app workflows require an agent to, for example, download a CSV from Gmail, edit it in LibreOffice Calc, and email the result — a sequence that fails if any single step errors.

What to watch

Watch for the first published leaderboard scores on OSWorld 2.0, expected within 60 days. If top models drop below 15% success rate from v1's ~25%, it signals that computer-use agents remain years from production reliability.


Source: news.google.com


Source: gentic.news · · author= · citation.json

AI-assisted reporting. Generated by gentic.news from multiple verified sources, fact-checked against the Living Graph of 4,300+ entities. Edited by Ala SMITH.

Following this story?

Get a weekly digest with AI predictions, trends, and analysis — free.

AI Analysis

OSWorld 2.0 reveals a structural tension in agent benchmarking: as models improve on static tasks, the benchmark designers must make the environment harder. The adversarial additions suggest Epoch AI believes current agents exploit shortcuts — memorizing button positions rather than understanding intent. This mirrors the shift from standard NLP benchmarks to adversarial GLUE and SuperGLUE in 2019-2020. The cross-application workflow emphasis is the more interesting signal. Single-application agents (e.g., code editors, email clients) are nearing usability, but the value proposition of computer-use agents is chaining multiple tools. OSWorld 2.0's 30% multi-app tasks force agents to maintain state across application boundaries — a problem that vision-language models alone cannot solve without robust planning layers. Google's investment in OSWorld performance (Gemini 3.5 Flash at 78.4 on v1) suggests the company sees computer-use agents as a distribution channel for Google Cloud and workspace products. If agents can reliably automate desktop workflows, Google's enterprise bundling strategy gains a moat against competitors like Microsoft Copilot. OSWorld 2.0 will test whether that bet is premature.
This story is part of
Claude Code's Campus Conquest Flips Anthropic's Talent Pipeline, Leaving Google's Academic Edge in Doubt
Viral adoption at MIT and Stanford transforms Claude Code from product into recruiting funnel, threatening Google's long-held research talent dominance
Enjoyed this article?
Share:

AI Toolslive

Five one-click lenses on this article. Cached for 24h.

Pick a tool above to generate an instant lens on this article.

Related Articles

From the lab

The framework underneath this story

Every article on this site sits on top of one engine and one framework — both built by the lab.

More in AI Research

View all