Screen-level OS control#7 of 14 in category
GPT-5.5
OpenAI · Launched May 2026
OpenAI's 2026 frontier model; OSWorld-Verified 78.7% and the Terminal-Bench 2.1 leader via Codex CLI (83.4%).
Visit GPT-5.5 →OpenAI API
3
Benchmarks scored
82.6
Peak score
0
Article mentions
No
Open source
Benchmark performance
SWE-Bench Verified
OpenAI-verified 500-issue subset of SWE-Bench. Approaching saturation in 2026 - most frontier models clear 80%+.
82.6
Gap to SOTA: -12.4pp (held by Claude Fable 5)Benchmark docs →
OSWorld-Verified
369 real tasks on a live Ubuntu desktop VM: file I/O, spreadsheets, creative apps, settings. The July 2025 Verified rebuild moved to AWS (50x parallel) and fixed 300+ task bugs. The flagship computer-use benchmark.
78.7
SWE-Bench Pro
Harder, contamination-resistant successor to SWE-Bench Verified: real GitHub issues with held-out tests. Where coding headroom remains.
58.6
Gap to SOTA: -10.6pp (held by Claude Opus 4.8)Benchmark docs →
Other screen-level os control agents
The 14 agents in this category, ranked by peak benchmark.
| Agent | Maker | Launch | Peak | Pricing |
|---|---|---|---|---|
| Claude Sonnet 4.6 | Anthropic | 2026-02 | 1470.0 | $3 / $15 per M tokens |
| Claude Opus 4.8 | Anthropic | 2026-05 | 88.6 | $5 / $25 per M tokens |
| UI-TARS-2OSS | ByteDance | 2025-09 | 88.2 | Open weights |
| Claude Opus 4.7 | Anthropic | 2026-03 | 87.6 | $5 / $25 per M tokens |
| Claude Mythos Preview | Anthropic | 2026-04 | 86.9 | Research preview |
| Gemini 3.1 Pro | Google DeepMind | 2026-02 | 85.9 | Google API |
| Holo3-35B-A3B | H Company | 2026-04 | 82.6 | H Company |
| GUI-Owl-1.5OSS | Alibaba | 2026-02 | 80.3 | Open weights |
| Holo3-122B-A10B | H Company | 2026-04 | 78.8 | H Company |
| GPT-5.4 | OpenAI | 2026-03 | 75.0 | OpenAI API |
Quick facts
- Type
- Screen-level OS control
- Maker
- OpenAI
- Launch
- 2026-05-01
- Open source
- No
- Pricing
- OpenAI API
- Benchmarks scored
- 3
- Article mentions
- 0
- Rank in category
- #7 of 14