Listen to today's AI briefing

Daily podcast — 5 min, AI-narrated summary of top stories

A cybersecurity dashboard shows CMU ExploitBench scores with Claude Mythos at 9.9 and GPT-5.5 at 5.5, alongside V8…

AI ResearchBreakthroughScore: 88

CMU Benchmark: Claude Mythos Hits 9.9/16 on V8 Exploits, GPT-5.5 Trails at 5.5

CMU's ExploitBench shows Claude Mythos scores 9.9/16 on V8 exploits vs GPT-5.5's 5.5, but costs $36,428 per run — 12x more. The cost-performance tradeoff is the real story.

AAAla SMITH & AI Research Desk·5h ago·4 min read··10 views·AI-Generated·Report error

Source: the-decoder.comvia the_decoderSingle Source

How did Claude Mythos and GPT-5.5 perform on CMU's new browser exploit benchmark?

Claude Mythos scored 9.9/16 on CMU's ExploitBench, reaching full code execution on 21 of 41 V8 vulnerabilities. GPT-5.5 scored 5.5, reaching top tier on just 2. Mythos cost $36,428 per run, 12x GPT-5.5's $3,075.

TL;DR

CMU ExploitBench scores AI agents on V8 exploitation · Claude Mythos hits 9.9/16, GPT-5.5 scores 5.5 · Mythos costs $36,428 per full run, 12x GPT-5.5

Carnegie Mellon University's ExploitBench reveals Claude Mythos scores 9.9/16 on real V8 exploits, while GPT-5.5 trails at 5.5. Mythos costs $36,428 per full run — 12x GPT-5.5's $3,075 — raising questions about cost-efficiency for autonomous vulnerability exploitation.

Key facts

Mythos scored 9.9/16 on ExploitBench, GPT-5.5 scored 5.5
Mythos reached full code execution on 21 of 41 V8 vulnerabilities
Full Mythos run cost $36,428 across 122 episodes
GPT-5.5 run cost $3,075 across 123 episodes, 12x cheaper
Autonomous mode: Mythos 9.55, GPT-5.5 via Codex 4.30

Key Takeaways

CMU's ExploitBench shows Claude Mythos scores 9.9/16 on V8 exploits vs GPT-5.5's 5.5, but costs $36,428 per run — 12x more.
The cost-performance tradeoff is the real story.

The Benchmark: Five Tiers of Real V8 Exploitation

Researchers at Carnegie Mellon University built ExploitBench, a benchmark that scores AI agents across five tiers of real-world exploitation against Google's V8 JavaScript engine — the core of Chrome, Edge, Node.js, and Cloudflare Workers. Unlike prior tests that only check for bug triggers, ExploitBench evaluates progress all the way up to arbitrary code execution on the target system. [According to The Decoder]

Mythos Dominates, But at a Steep Price

Anthropic's Claude Mythos Preview, with occasional human hints, hit an average score of 9.90 out of 16 and reached the highest tier on 21 of 41 vulnerabilities. OpenAI's GPT-5.5 trailed far behind at 5.51 points, reaching the top tier on just two. The gap widens in fully autonomous mode: Mythos scored 9.55 points (barely any drop), while GPT-5.5 via Codex managed only 4.30. None of the other tested models achieved full code execution. [According to the source]

Image description

The cost disparity is stark: the full Mythos test run across 122 episodes cost about $36,428, according to ExploitBench. GPT-5.5 via Codex ran 123 episodes for roughly $3,075, about twelve times cheaper. The UK's AI Safety Institute also confirmed that Mythos performs somewhat better than GPT-5.5 but at a much higher cost in a recent test. The price gap suggests OpenAI could close the performance gap by throwing more compute at the problem.

Unique Take: The Cost-Performance Tradeoff Is the Real Story

While the headline is that Mythos outperforms GPT-5.5, the more interesting structural observation is the 12x cost multiplier for about 2x the benchmark score. This mirrors a pattern seen across the past 90 days in AI agent benchmarks: Anthropic models tend to be more sample-efficient on complex multi-step tasks, but OpenAI's architecture may have more headroom for scaling. The price gap suggests OpenAI could close the performance gap by throwing more compute at the problem — a bet the company has historically been willing to make.

Human Expert Review Validates Results

ExploitBench co-author Seunghyun Lee — an experienced security researcher with over 20 reported browser vulnerabilities — reviewed the Mythos transcripts one by one. His takeaway: the model works like a 'fairly competent browser / JS engine security researcher.' In one case, Mythos developed an exploit technique that Lee and a colleague had previously dismissed as too complex. In another, it reproduced a vulnerability (CVE-2024-0519) that human researchers had failed to crack for over a year, according to Lee.

ExploitBench leaderboard: Anthropic's Claude Mythos Preview leads OpenAI's GPT-5.5 by a wide margin. Only these two models reach the highest tier, T1,

The researchers acknowledge that the tested bugs are publicly known, and models could theoretically draw on training data. But the dataset also includes vulnerabilities with no public exploit or bug report. The benchmark doesn't yet measure the ability to find zero-day vulnerabilities — only the ability to exploit known ones.

What to watch

Watch for OpenAI's next model release — likely within 3-6 months — to see if it closes the gap on ExploitBench with more inference compute. Also track whether Anthropic can reduce Mythos inference costs by 10x without losing performance, which would make autonomous exploit development economically viable for security teams.

Sources cited in this article

The Decoder
ExploitBench. GPT-5.5
Lee.

Source: gentic.news · 5h ago · author=Ala SMITH · citation.json

AI-assisted reporting. Generated by gentic.news from 3 verified sources, fact-checked against the Living Graph of 4,300+ entities. Edited by Ala SMITH.

Following this story?

Get a weekly digest with AI predictions, trends, and analysis — free.

AI Analysis

The ExploitBench results reveal a clear structural tradeoff: Anthropic's Mythos is more sample-efficient per dollar, but OpenAI's GPT-5.5 has a 12x cost advantage that may be more valuable in production. For security teams, the question isn't which model is smarter — it's which model can reliably exploit vulnerabilities at a price point that makes sense for continuous red-teaming. The human expert validation is notable: Lee's review confirms that Mythos is not just pattern-matching from training data but developing novel exploit techniques. However, the benchmark's limitation to known CVEs means we still don't know how these models would perform on zero-day discovery — the true test of autonomous vulnerability research. Comparing to prior benchmarks, this mirrors the pattern seen in SWE-Bench where Anthropic models often lead on complex multi-step tasks but at higher cost. The UK AISI replication adds credibility but also confirms the cost gap. The key insight: if OpenAI can close the performance gap with more compute, the cost advantage flips the narrative entirely.

#ai security #autonomous agents #benchmarks #language models

Compare side-by-side

Claude Mythos vs GPT-3.5

→

Mentioned in this article

Claude Mythos GPT-3.5 ExploitBench Carnegie Mellon University Codex API

Enjoyed this article?

Get the weekly AI intelligence briefing

✨AI Toolslive

Five one-click lenses on this article. Cached for 24h.

Pick a tool above to generate an instant lens on this article.

AI Research

MCP vs CLI Debate Resolved by Anthropic's Code Mode: 98.7% Token Drop

From the lab

The framework underneath this story

Every article on this site sits on top of one engine and one framework — both built by the lab.

Original research · EUMAS 2026

MNEMA — A Witness Lattice for Multi-Agent AI Memory

Cryptographic memory units · 1−α detection floor · 15 pp PDF

Field framework · v1.0

Epistemic Infrastructure

12 pillars · 11-stage knowledge metabolism · pathology catalog

CMU Benchmark: Claude Mythos Hits 9.9/16 on V8 Exploits, GPT-5.5 Trails at 5.5

Key Takeaways

The Benchmark: Five Tiers of Real V8 Exploitation

Mythos Dominates, But at a Steep Price

Unique Take: The Cost-Performance Tradeoff Is the Real Story

Human Expert Review Validates Results

What to watch

Sources cited in this article

AI Analysis

✨AI Toolslive

Related Articles

Claude Mythos Clears All UK Cyberattack Simulators, Doubling Speed Revised

Simple Graph Heuristic Beats Generative Recommenders on 10 of 14 Benchmarks

RRCM Uses GRPO to Decide When to Retrieve for LLM Recommendation

AMD ROCm Performance Jumps 75x in 14 Days Post-DeepSeek v4

Claude Code's Six-Layer Architecture: Harness, Not Magic

MCP vs CLI Debate Resolved by Anthropic's Code Mode: 98.7% Token Drop

The framework underneath this story

More in AI Research

Glean benchmark: Off-the-shelf MCP costs 30% more tokens than indexed context

SDAR: Self-Distilled RL Stabilizes Multi-Turn LLM Agents, +9.4% on ALFWorld