Epoch AI released OSWorld 2.0, expanding the agent benchmark to 1,500 real-world desktop tasks. The update targets the reliability ceiling of computer-use AI agents, adding adversarial tests and cross-application workflows.
Key facts
- OSWorld 2.0: 1,500 tasks, 4x more than v1's 369
- Tests across macOS, Windows, and Ubuntu
- 30% of tasks require cross-application workflows
- Adversarial tests inject typos and UI changes
- Gemini 3.5 Flash scored 78.4 on v1 in June 2026
Epoch AI released OSWorld 2.0, expanding the benchmark to 1,500 real-world desktop tasks across macOS, Windows, and Ubuntu — up from 369 in v1, according to the source. The new version targets the reliability ceiling of computer-use agents, adding adversarial robustness tests that inject typos and UI changes, plus 30% more tasks requiring multi-step reasoning across multiple applications.
Why OSWorld matters
![[3/8] Existing desktop benchmarks now risk looking close to ...](https://pbs.twimg.com/media/HLvyTOWWcAA-eOI.jpg)
OSWorld has become the de facto benchmark for computer-use agents — models that control desktop interfaces by interpreting screenshots and executing mouse/keyboard actions. Google's Gemini 3.5 Flash scored 78.4 on OSWorld v1 in June 2026, matching GPT-5.5. The benchmark's difficulty stems from its grounding in real operating systems: agents must handle file dialogs, browser navigation, and application menus that change with OS updates.
What v2 changes
OSWorld 2.0 adds adversarial examples — deliberately misspelled filenames, altered button labels, and window resizing — to test whether agents rely on brittle visual patterns rather than semantic understanding. The new tasks span 12 application categories including spreadsheet manipulation, image editing, and terminal commands. Epoch AI did not disclose whether any model has completed a full evaluation on v2.
The reliability problem

Current computer-use agents succeed on roughly 20-30% of OSWorld v1 tasks, per public leaderboards. The v2 expansion suggests the field needs fundamental improvements in agent planning and error recovery rather than incremental vision-model gains. The benchmark's multi-app workflows require an agent to, for example, download a CSV from Gmail, edit it in LibreOffice Calc, and email the result — a sequence that fails if any single step errors.
What to watch
Watch for the first published leaderboard scores on OSWorld 2.0, expected within 60 days. If top models drop below 15% success rate from v1's ~25%, it signals that computer-use agents remain years from production reliability.
Source: news.google.com








