Skip to content
gentic.news — AI News Intelligence Platform
Connecting to the Living Graph…

Listen to today's AI briefing

Daily podcast — 5 min, AI-narrated summary of top stories

A sleek robotic arm on a lab bench precisely assembles a small electronic circuit board, with glowing blue lights…
AI ResearchScore: 72

WorkBench Revisited: Claude Opus 4.8 Hits 89% Task Completion

Claude Opus 4.8 completes 89% of WorkBench tasks with 2.5% harm rate, up from GPT-4's 43% and 26% in 2024, showing capability and safety align.

·1d ago·3 min read··16 views·AI-Generated·Report error
Share:
Source: arxiv.orgvia arxiv_ai, reddit_claudeCorroborated
How does Claude Opus 4.8 perform on the WorkBench benchmark compared to GPT-4 in 2024?

Claude Opus 4.8 completes 89% of WorkBench tasks with a 2.5% harmful side-effect rate, up from GPT-4's 43% and 26% in 2024, per the updated benchmark released June 10, 2026.

TL;DR

Claude Opus 4.8 completes 89% of WorkBench tasks. · Unintended harmful actions dropped from 26% to 2.5%. · Open-weight models now match 2024 frontier performance at lower cost.

Claude Opus 4.8 completes 89% of WorkBench tasks and causes unintended harm on just 2.5% of them, per the updated benchmark released June 10, 2026. That is a dramatic improvement from GPT-4's 43% completion and 26% harm rate in March 2024.

Key facts

  • Claude Opus 4.8: 89% task completion, 2.5% harm rate.
  • GPT-4 (2024): 43% task completion, 26% harm rate.
  • Open-weight models match 2024 frontier performance at lower cost.
  • Capability and safety correlate positively on WorkBench.
  • Residual errors cause irreversible harm in rare cases.

The WorkBench Revisited paper by Olly Styles evaluates frontier and open-weight agents across 690 workplace tasks, measuring both task completion and harmful side effects. The best agent in 2024, GPT-4, completed 43% of tasks and took an unintended harmful action—such as emailing the wrong person—on 26% of them. By June 2026, Claude Opus 4.8 completes 89% and takes an unintended harmful action on 2.5%.

Capability and Safety Align, Not Trade Off

Claude Opus 4.8 Released With Ability to Work as an ...

A key finding is that capability and safety go together on WorkBench rather than trade off. The models that finish the most tasks also do the least unintended damage. This contradicts the common assumption that more capable agents necessarily introduce greater risk. According to the paper, models with higher task completion rates consistently show lower harmful side-effect rates.

Open-Weight Models Close the Cost Gap

The rise of open-weight models has drastically lowered costs for a performance level that was previously only accessible to proprietary models. Frontier costs have stayed relatively stable, while open-weight models now achieve comparable task completion at a fraction of the per-task cost. The paper plots cost per task versus completion, showing an efficient frontier where open-weight models cluster at lower cost points.

Figure 3: Cost per task versus task completion on WorkBench. Cost per taskis the total spend to run the benchmark once

Residual Errors Persist

Claude Opus 4.1 \ Anthropic

While several classes of error have been totally eliminated, frontier models still make some basic mistakes that occasionally result in irreversible harm, such as sending an email to the wrong person. The paper notes that these errors are rare but consequential, suggesting that agent safety remains an unsolved problem at the tail end of the distribution.

Figure 2: Task completion on WorkBench by release date. Successful taskcompletion for every evaluated model against its

What to watch

Watch for the next WorkBench update when GPT-5.5 or Gemini Ultra 3 scores are published. The key metric will be whether the harm rate can drop below 1% while maintaining >90% task completion, and whether open-weight models can break the 80% completion barrier.


Source: arxiv.org


Source: gentic.news · · author= · citation.json

AI-assisted reporting. Generated by gentic.news from multiple verified sources, fact-checked against the Living Graph of 4,300+ entities. Edited by Ala SMITH.

Following this story?

Get a weekly digest with AI predictions, trends, and analysis — free.

AI Analysis

The WorkBench Revisited paper provides the strongest empirical evidence to date that AI agent capability and safety are not in tension. The correlation between high task completion and low harm rates suggests that the same architectural improvements that boost task performance—better reasoning, more robust context handling, improved instruction following—also reduce unintended side effects. This is a direct counterargument to the 'safety tax' narrative that has dominated policy discussions. However, the residual errors are instructive. The fact that frontier models still occasionally email the wrong person indicates that current architectures lack a robust 'undo' mechanism or reliable pre-action verification. This is not a scaling problem—it is an architectural gap. The paper's cost analysis is also significant: open-weight models have collapsed the cost of achieving 2024-level agent performance, which will accelerate enterprise adoption of agentic workflows but also increase the attack surface for misuse. The paper's methodology is sound, but the benchmark's 690 tasks may not capture the full distribution of real-world workplace failures. The harm rate metric, while valuable, does not distinguish between reversible and irreversible harm—a distinction that matters for deployment decisions.
Compare side-by-side
Claude Opus 4.6 vs GPT-4 Turbo
Enjoyed this article?
Share:

AI Toolslive

Five one-click lenses on this article. Cached for 24h.

Pick a tool above to generate an instant lens on this article.

Related Articles

From the lab

The framework underneath this story

Every article on this site sits on top of one engine and one framework — both built by the lab.

More in AI Research

View all