What makes OSWorld different from other agent benchmarks?

OSWorld tests agents on real operating systems with actual mouse and keyboard actions, not simulated environments, making it harder and more realistic.

Why are adversarial tests important for agent benchmarks?

They reveal whether agents rely on superficial visual cues rather than understanding the task, a key failure mode for production deployments.

![[3/8] Existing desktop benchmarks now risk looking close to ...](https://pbs.twimg.com/media/HLvyTOWWcAA-eOI.jpg)

Listen to today's AI briefing

Daily podcast — 5 min, AI-narrated summary of top stories

Listen

A computer monitor displays a desktop interface with multiple open windows, including a code editor and a file…

AI ResearchScore: 77

OSWorld 2.0 Launches, Tests AI Agents on 1,500 Desktop Tasks

Epoch AI released OSWorld 2.0 with 1,500 desktop tasks, up from 369 in v1, testing AI agents on adversarial and cross-application workflows.

AAAla SMITH & AI Research Desk·23h ago·2 min read··7 views·AI-Generated·Report error

Source: news.google.comvia epoch_ai_gradient_updates_gnCorroborated

What is OSWorld 2.0 and how does it benchmark AI agents?

OSWorld 2.0 expands the AI agent benchmark to 1,500 real-world desktop tasks across macOS, Windows, and Ubuntu, up from 369 in v1, testing computer-use agents on file operations, web browsing, and multi-step workflows.

TL;DR

OSWorld 2.0 adds 1,500 real desktop tasks · Google's Gemini 3.5 Flash scored 78.4 on v1 · Benchmark targets computer-use agent reliability

Epoch AI released OSWorld 2.0, expanding the agent benchmark to 1,500 real-world desktop tasks. The update targets the reliability ceiling of computer-use AI agents, adding adversarial tests and cross-application workflows.

Key facts

OSWorld 2.0: 1,500 tasks, 4x more than v1's 369
Tests across macOS, Windows, and Ubuntu
30% of tasks require cross-application workflows
Adversarial tests inject typos and UI changes
Gemini 3.5 Flash scored 78.4 on v1 in June 2026

Epoch AI released OSWorld 2.0, expanding the benchmark to 1,500 real-world desktop tasks across macOS, Windows, and Ubuntu — up from 369 in v1, according to the source. The new version targets the reliability ceiling of computer-use agents, adding adversarial robustness tests that inject typos and UI changes, plus 30% more tasks requiring multi-step reasoning across multiple applications.

Why OSWorld matters

[3/8] Existing desktop benchmarks now risk looking close to ...

OSWorld has become the de facto benchmark for computer-use agents — models that control desktop interfaces by interpreting screenshots and executing mouse/keyboard actions. Google's Gemini 3.5 Flash scored 78.4 on OSWorld v1 in June 2026, matching GPT-5.5. The benchmark's difficulty stems from its grounding in real operating systems: agents must handle file dialogs, browser navigation, and application menus that change with OS updates.

What v2 changes

OSWorld 2.0 adds adversarial examples — deliberately misspelled filenames, altered button labels, and window resizing — to test whether agents rely on brittle visual patterns rather than semantic understanding. The new tasks span 12 application categories including spreadsheet manipulation, image editing, and terminal commands. Epoch AI did not disclose whether any model has completed a full evaluation on v2.

The reliability problem

Meet OSWorld: Revolutionizing Autonomous Agent Development with Real ...

Current computer-use agents succeed on roughly 20-30% of OSWorld v1 tasks, per public leaderboards. The v2 expansion suggests the field needs fundamental improvements in agent planning and error recovery rather than incremental vision-model gains. The benchmark's multi-app workflows require an agent to, for example, download a CSV from Gmail, edit it in LibreOffice Calc, and email the result — a sequence that fails if any single step errors.

What to watch

Watch for the first published leaderboard scores on OSWorld 2.0, expected within 60 days. If top models drop below 15% success rate from v1's ~25%, it signals that computer-use agents remain years from production reliability.

Source: news.google.com

Source: gentic.news · 23h ago · author=Ala SMITH · citation.json

AI-assisted reporting. Generated by gentic.news from multiple verified sources, fact-checked against the Living Graph of 4,300+ entities. Edited by Ala SMITH.

Following this story?

Get a weekly digest with AI predictions, trends, and analysis — free.

AI Analysis

OSWorld 2.0 reveals a structural tension in agent benchmarking: as models improve on static tasks, the benchmark designers must make the environment harder. The adversarial additions suggest Epoch AI believes current agents exploit shortcuts — memorizing button positions rather than understanding intent. This mirrors the shift from standard NLP benchmarks to adversarial GLUE and SuperGLUE in 2019-2020. The cross-application workflow emphasis is the more interesting signal. Single-application agents (e.g., code editors, email clients) are nearing usability, but the value proposition of computer-use agents is chaining multiple tools. OSWorld 2.0's 30% multi-app tasks force agents to maintain state across application boundaries — a problem that vision-language models alone cannot solve without robust planning layers. Google's investment in OSWorld performance (Gemini 3.5 Flash at 78.4 on v1) suggests the company sees computer-use agents as a distribution channel for Google Cloud and workspace products. If agents can reliably automate desktop workflows, Google's enterprise bundling strategy gains a moat against competitors like Microsoft Copilot. OSWorld 2.0 will test whether that bet is premature.

#ai benchmarks #ai agents #google

This story is part of

Claude Code's Campus Conquest Flips Anthropic's Talent Pipeline, Leaving Google's Academic Edge in Doubt

Viral adoption at MIT and Stanford transforms Claude Code from product into recruiting funnel, threatening Google's long-held research talent dominance

Mentioned in this article

Epoch AI OSWorld 2.0 Gemini 3 Flash Google AI Agents

Enjoyed this article?

Get the weekly AI intelligence briefing

✨AI Toolslive

Five one-click lenses on this article. Cached for 24h.

Pick a tool above to generate an instant lens on this article.

AI Research2 shared topics

Epoch AI's CursorBench Benchmarks AI Code Editing at Scale

From the lab

The framework underneath this story

Every article on this site sits on top of one engine and one framework — both built by the lab.

Original research · EUMAS 2026

MNEMA — A Witness Lattice for Multi-Agent AI Memory

Cryptographic memory units · 1−α detection floor · 15 pp PDF

Field framework · v1.0

Epistemic Infrastructure

12 pillars · 11-stage knowledge metabolism · pathology catalog

More in AI Research

View all

Alibaba's Qwen-AgentWorld open-source model interface on Hugging Face with code and streaming inference tools

AI Research

Alibaba Open-Sources Qwen-AgentWorld for Generalist Agent Training

Alibaba open-sourced Qwen-AgentWorld and Wan-Streamer v0.1 on Hugging Face, targeting generalist agent training and real-time streaming. The releases include 8 additional papers on agent benchmarks and architectures.

x.com/5h ago/3 min read

open-sourceagentic aiworld models

A terminal window displays command-line output with benchmark results, showing a 33.4% score, while a bar chart…

AI Research

CLI-Universe: Qwen3-32B fine-tuned on 6K trajectories beats models 10x larger on Terminal-Bench 2.0

CLI-Universe synthesizes terminal-agent tasks; Qwen3-32B fine-tuned on 6K trajectories hits 33.4% on Terminal-Bench 2.0, beating models 10x larger.

x.com/1d ago/3 min read

agentic aifine-tuningbenchmarks

Why OSWorld matters

What v2 changes

The reliability problem

What to watch

AI Analysis

✨AI Toolslive

Related Articles

MCP Explained: The Standard Quietly Changing How AI Agents Connect to Data

DeepMind paper: hidden web content hijacks agents 86% of the time

Gemini 3.5 Flash Scores 78.4 on OSWorld, Matching GPT-5.5

Apple Using Custom 1.2T-Parameter Google Model for Siri Overhaul

Gemini 3.1 Flash Leak Hints at Google I/O 2026 Launch

Epoch AI's CursorBench Benchmarks AI Code Editing at Scale

The framework underneath this story

More in AI Research

Alibaba Open-Sources Qwen-AgentWorld for Generalist Agent Training

CLI-Universe: Qwen3-32B fine-tuned on 6K trajectories beats models 10x larger on Terminal-Bench 2.0