CMU researchers built Gym-Anything, a system that turns any software into an AI agent training environment. The resulting CUA-World benchmark spans 10,000+ tasks across 200 applications covering all 22 major occupation groups, according to the arXiv preprint.
Key facts
- Gym-Anything automates creation of agent training environments.
- CUA-World includes 10,000+ tasks across 200 applications.
- Covers all 22 major occupation groups.
- Two-agent loop: one creates, one audits.
- Strong models fail most long, real-world tasks.
Most agent benchmarks test on curated, short web or desktop tasks that don't reflect messy, long-running workplace workflows. Gym-Anything per the arXiv paper attacks this setup bottleneck by making environment creation itself an agent job.
Key Takeaways
- CMU's Gym-Anything automates agent environment creation, producing CUA-World with 10,000+ tasks.
- Even strong models fail most long tasks, showing real computer-use work is unsolved.
How the Two-Agent Loop Works
One agent writes scripts, installs software, loads real data, opens the app, and collects proof that it works. A second agent audits that proof with screenshots, logs, files, and checklists, then sends fixes back when the setup is weak. Using this loop, the authors built CUA-World, with 10,000+ tasks across 200 applications covering all 22 major occupation groups.
The result shows even strong models solved only a small share of the hardest long tasks — the paper does not disclose exact pass rates for individual models, but states that real computer-use work remains "far from solved." This mirrors findings from earlier agent benchmarks like SWE-Bench, where even top models hit a ceiling around 50% on realistic software engineering tasks.
The Bad News: Real Work Still Bites
Once tasks look like real work — long, multi-step, with unpredictable state — today's agents fail a lot. Gym-Anything's key contribution is methodological: it removes the manual labor of building training environments, which has been the gating factor for scaling agent evaluation. The two-agent verification loop is novel, using one agent to generate and another to audit, which reduces the risk of stale or incorrect environment setups.
The research also implicitly critiques the current benchmark ecosystem. Most evaluations use small, hand-crafted tasks that don't capture the long-tail complexity of enterprise software. By covering all 22 major occupation groups (from healthcare to construction), CUA-World aims to be a more representative test of generalist agent capability.
Limitations
Gym-Anything's environment quality depends on the auditing agent's reliability. If the auditor misses bugs, the training signal degrades. The paper does not report the auditor's false-positive or false-negative rates, a gap that matters for production use. Additionally, the 200 applications are a sample — covering all occupation groups doesn't mean covering all software within each group.
What to watch
Watch for follow-up evaluations that disclose per-model pass rates on CUA-World, and whether OpenAI or Anthropic adopt Gym-Anything for internal red-teaming. If the two-agent loop is integrated into commercial agent frameworks, it could standardize how agent training environments are built.









