Claude Opus 4.8 completes 89% of WorkBench tasks and causes unintended harm on just 2.5% of them, per the updated benchmark released June 10, 2026. That is a dramatic improvement from GPT-4's 43% completion and 26% harm rate in March 2024.
Key facts
- Claude Opus 4.8: 89% task completion, 2.5% harm rate.
- GPT-4 (2024): 43% task completion, 26% harm rate.
- Open-weight models match 2024 frontier performance at lower cost.
- Capability and safety correlate positively on WorkBench.
- Residual errors cause irreversible harm in rare cases.
The WorkBench Revisited paper by Olly Styles evaluates frontier and open-weight agents across 690 workplace tasks, measuring both task completion and harmful side effects. The best agent in 2024, GPT-4, completed 43% of tasks and took an unintended harmful action—such as emailing the wrong person—on 26% of them. By June 2026, Claude Opus 4.8 completes 89% and takes an unintended harmful action on 2.5%.
Capability and Safety Align, Not Trade Off

A key finding is that capability and safety go together on WorkBench rather than trade off. The models that finish the most tasks also do the least unintended damage. This contradicts the common assumption that more capable agents necessarily introduce greater risk. According to the paper, models with higher task completion rates consistently show lower harmful side-effect rates.
Open-Weight Models Close the Cost Gap
The rise of open-weight models has drastically lowered costs for a performance level that was previously only accessible to proprietary models. Frontier costs have stayed relatively stable, while open-weight models now achieve comparable task completion at a fraction of the per-task cost. The paper plots cost per task versus completion, showing an efficient frontier where open-weight models cluster at lower cost points.

Residual Errors Persist

While several classes of error have been totally eliminated, frontier models still make some basic mistakes that occasionally result in irreversible harm, such as sending an email to the wrong person. The paper notes that these errors are rare but consequential, suggesting that agent safety remains an unsolved problem at the tail end of the distribution.

What to watch
Watch for the next WorkBench update when GPT-5.5 or Gemini Ultra 3 scores are published. The key metric will be whether the harm rate can drop below 1% while maintaining >90% task completion, and whether open-weight models can break the 80% completion barrier.
Source: arxiv.org








