WorkBench is a benchmark for workplace AI agents that measures task completion and unintended harmful actions across 690 tasks.

How much did Claude Opus 4.8 improve over GPT-4?

Task completion doubled from 43% to 89%, while harmful side effects dropped from 26% to 2.5%.

Listen to today's AI briefing

Daily podcast — 5 min, AI-narrated summary of top stories

Listen

A sleek robotic arm on a lab bench precisely assembles a small electronic circuit board, with glowing blue lights…

AI ResearchScore: 84

WorkBench Revisited: Claude Opus 4.8 Hits 89% Task Completion

Claude Opus 4.8 completes 89% of WorkBench tasks with 2.5% harm rate, up from GPT-4's 43% and 26% in 2024, showing capability and safety align.

AAAla SMITH & AI Research Desk·Jun 15, 2026·3 min read··99 views·AI-Generated·Report error

Source: arxiv.orgvia arxiv_ai, reddit_claude, devto_claudecode, reddit_claudecodeMulti-Source

How does Claude Opus 4.8 perform on the WorkBench benchmark compared to GPT-4 in 2024?

Claude Opus 4.8 completes 89% of WorkBench tasks with a 2.5% harmful side-effect rate, up from GPT-4's 43% and 26% in 2024, per the updated benchmark released June 10, 2026.

TL;DR

Claude Opus 4.8 completes 89% of WorkBench tasks. · Unintended harmful actions dropped from 26% to 2.5%. · Open-weight models now match 2024 frontier performance at lower cost.

Claude Opus 4.8 completes 89% of WorkBench tasks and causes unintended harm on just 2.5% of them, per the updated benchmark released June 10, 2026. That is a dramatic improvement from GPT-4's 43% completion and 26% harm rate in March 2024.

Key facts

Claude Opus 4.8: 89% task completion, 2.5% harm rate.
GPT-4 (2024): 43% task completion, 26% harm rate.
Open-weight models match 2024 frontier performance at lower cost.
Capability and safety correlate positively on WorkBench.
Residual errors cause irreversible harm in rare cases.

The WorkBench Revisited paper by Olly Styles evaluates frontier and open-weight agents across 690 workplace tasks, measuring both task completion and harmful side effects. The best agent in 2024, GPT-4, completed 43% of tasks and took an unintended harmful action—such as emailing the wrong person—on 26% of them. By June 2026, Claude Opus 4.8 completes 89% and takes an unintended harmful action on 2.5%.

Capability and Safety Align, Not Trade Off

Claude Opus 4.8 Released With Ability to Work as an ...

A key finding is that capability and safety go together on WorkBench rather than trade off. The models that finish the most tasks also do the least unintended damage. This contradicts the common assumption that more capable agents necessarily introduce greater risk. According to the paper, models with higher task completion rates consistently show lower harmful side-effect rates.

Open-Weight Models Close the Cost Gap

The rise of open-weight models has drastically lowered costs for a performance level that was previously only accessible to proprietary models. Frontier costs have stayed relatively stable, while open-weight models now achieve comparable task completion at a fraction of the per-task cost. The paper plots cost per task versus completion, showing an efficient frontier where open-weight models cluster at lower cost points.

Figure 3: Cost per task versus task completion on WorkBench. Cost per taskis the total spend to run the benchmark once

Residual Errors Persist

$Claude Opus 4.1 \ Anthropic$

While several classes of error have been totally eliminated, frontier models still make some basic mistakes that occasionally result in irreversible harm, such as sending an email to the wrong person. The paper notes that these errors are rare but consequential, suggesting that agent safety remains an unsolved problem at the tail end of the distribution.

Figure 2: Task completion on WorkBench by release date. Successful taskcompletion for every evaluated model against its

What to watch

Watch for the next WorkBench update when GPT-5.5 or Gemini Ultra 3 scores are published. The key metric will be whether the harm rate can drop below 1% while maintaining >90% task completion, and whether open-weight models can break the 80% completion barrier.

Source: arxiv.org

[Updated 18 Jun via devto_claudecode]

A separate infrastructure bug in Claude Code v2.1.154 (May 28, 2026) caused silent thinking block corruption during multi-turn sessions with Opus 4.8, producing unrecoverable HTTP 400 errors on subsequent turns. The hotfix v2.1.156, released May 29 at 01:42 UTC, patches the mutation path by enforcing frozen dataclass replay of signed thinking blocks [per dev.to]. Only Opus 4.8 with extended thinking active was affected; Opus 4.7 and other model variants were not impacted.

Source: gentic.news · Jun 15, 2026 · author=Ala SMITH · citation.json

AI-assisted reporting. Generated by gentic.news from multiple verified sources, fact-checked against the Living Graph of 4,300+ entities. Edited by Ala SMITH.

Following this story?

Get a weekly digest with AI predictions, trends, and analysis — free.

AI Analysis

The WorkBench Revisited paper provides the strongest empirical evidence to date that AI agent capability and safety are not in tension. The correlation between high task completion and low harm rates suggests that the same architectural improvements that boost task performance—better reasoning, more robust context handling, improved instruction following—also reduce unintended side effects. This is a direct counterargument to the 'safety tax' narrative that has dominated policy discussions. However, the residual errors are instructive. The fact that frontier models still occasionally email the wrong person indicates that current architectures lack a robust 'undo' mechanism or reliable pre-action verification. This is not a scaling problem—it is an architectural gap. The paper's cost analysis is also significant: open-weight models have collapsed the cost of achieving 2024-level agent performance, which will accelerate enterprise adoption of agentic workflows but also increase the attack surface for misuse. The paper's methodology is sound, but the benchmark's 690 tasks may not capture the full distribution of real-world workplace failures. The harm rate metric, while valuable, does not distinguish between reversible and irreversible harm—a distinction that matters for deployment decisions.

#anthropic #agent safety #benchmarks #ai research

Compare side-by-side

Claude Opus 4.6 vs GPT-4 Turbo

→

Mentioned in this article

Claude Opus 4.6 WorkBench GPT-4 Turbo Olly Styles

Enjoyed this article?

Get the weekly AI intelligence briefing

✨AI Toolslive

Five one-click lenses on this article. Cached for 24h.

Pick a tool above to generate an instant lens on this article.

AI Research2 shared topics

Opus 5 Hits 0% Prompt Injection Rate in Browser Agents

From the lab

The framework underneath this story

Every article on this site sits on top of one engine and one framework — both built by the lab.

Original research · EUMAS 2026

MNEMA — A Witness Lattice for Multi-Agent AI Memory

Cryptographic memory units · 1−α detection floor · 15 pp PDF

Field framework · v1.0

Epistemic Infrastructure

12 pillars · 11-stage knowledge metabolism · pathology catalog

WorkBench Revisited: Claude Opus 4.8 Hits 89% Task Completion

Capability and Safety Align, Not Trade Off

Open-Weight Models Close the Cost Gap

Residual Errors Persist

What to watch

AI Analysis

✨AI Toolslive

Related Articles

GPT-4 Held Top Spot 52 Weeks; Today's Models Last 7

OpenAI hits 38.3% on ARC-AGI-3 with custom API, bypassing official harness

AgiBot WITA-Omni Scores 85.21 on DailyOmni, Beats Gemini

BYD HyWorldVLA Hits 90.59 PDMS on NAVSIM v1

Claude Mythos Finds HAWK Attack in 60 Hours for $100K