OSWorld (introduced in 2024 by researchers from Microsoft, UW, CMU, and others) is a benchmark for evaluating multimodal AI agents on open-ended computer tasks. It addresses a critical gap in prior evaluations: while earlier benchmarks like MiniWob++ or WebArena focused on constrained web-based tasks, OSWorld requires agents to interact with real, unmodified operating systems (Ubuntu, macOS, Android) via keyboard and mouse actions, processing raw pixel screenshots and executing multi-step workflows.
How it works: OSWorld provides a set of 369 tasks across 9 domains (e.g., office productivity, file management, web browsing, multimedia). Each task specifies a goal (e.g., "Save the LibreOffice document as a PDF in the Downloads folder") and an initial environment state. The agent must output a sequence of low-level actions (mouse clicks, keystrokes, scrolls) to complete the task. Success is measured by whether the final system state matches the goal — for example, checking that a file exists with the correct name and extension. The benchmark uses a deterministic scoring script that validates the end state rather than the path taken, allowing for different strategies.
Why it matters: OSWorld is currently the most realistic and challenging benchmark for generalist computer agents. Prior work (e.g., GPT-4V with SoM, CogAgent, UI-Adapter) showed strong performance on simpler benchmarks like ScreenSpot or Mind2Web, but OSWorld reveals that even state-of-the-art models struggle: as of late 2025, the best-performing agent (a fine-tuned version of GPT-4o with a custom action space) achieves only ~38% task success, while human performance is ~72%. This gap highlights fundamental limitations in current models' ability to understand screen layouts, plan multi-step procedures, and recover from errors.
When it's used vs. alternatives: OSWorld is used when evaluating general-purpose computer control, as opposed to:
- WebArena/VisualWebArena (web-only tasks)
- MiniWob++ (simplified web UI)
- AndroidEnv (Android-only)
- Meta-World or RLBench (robotics, not desktop)
Common pitfalls: (1) Agents often fail on tasks requiring precise timing or drag-and-drop. (2) Models struggle with unseen software versions or non-English UI elements. (3) Many agents rely on brittle OCR or DOM parsing, which fails on games or custom GUI frameworks. (4) Evaluation is costly — each run requires a full VM reset, limiting iteration speed.
Current state of the art (2026): The leading approaches use vision-language models fine-tuned on human demonstration trajectories (e.g., OS-1 by Microsoft, based on Phi-3.5-vision) combined with a learned action head that predicts mouse coordinates and key presses. Reinforcement learning from human feedback (RLHF) on partial task completions has shown modest gains. The community is moving toward "agentic loops" where the model can call external tools (e.g., Python scripts, shell commands) to overcome screen understanding bottlenecks. No model has yet exceeded 45% task success on the full OSWorld benchmark.