Carnegie Mellon University's ExploitBench reveals Claude Mythos scores 9.9/16 on real V8 exploits, while GPT-5.5 trails at 5.5. Mythos costs $36,428 per full run — 12x GPT-5.5's $3,075 — raising questions about cost-efficiency for autonomous vulnerability exploitation.
Key facts
- Mythos scored 9.9/16 on ExploitBench, GPT-5.5 scored 5.5
- Mythos reached full code execution on 21 of 41 V8 vulnerabilities
- Full Mythos run cost $36,428 across 122 episodes
- GPT-5.5 run cost $3,075 across 123 episodes, 12x cheaper
- Autonomous mode: Mythos 9.55, GPT-5.5 via Codex 4.30
Key Takeaways
- CMU's ExploitBench shows Claude Mythos scores 9.9/16 on V8 exploits vs GPT-5.5's 5.5, but costs $36,428 per run — 12x more.
- The cost-performance tradeoff is the real story.
The Benchmark: Five Tiers of Real V8 Exploitation
Researchers at Carnegie Mellon University built ExploitBench, a benchmark that scores AI agents across five tiers of real-world exploitation against Google's V8 JavaScript engine — the core of Chrome, Edge, Node.js, and Cloudflare Workers. Unlike prior tests that only check for bug triggers, ExploitBench evaluates progress all the way up to arbitrary code execution on the target system. [According to The Decoder]
Mythos Dominates, But at a Steep Price
Anthropic's Claude Mythos Preview, with occasional human hints, hit an average score of 9.90 out of 16 and reached the highest tier on 21 of 41 vulnerabilities. OpenAI's GPT-5.5 trailed far behind at 5.51 points, reaching the top tier on just two. The gap widens in fully autonomous mode: Mythos scored 9.55 points (barely any drop), while GPT-5.5 via Codex managed only 4.30. None of the other tested models achieved full code execution. [According to the source]

The cost disparity is stark: the full Mythos test run across 122 episodes cost about $36,428, according to ExploitBench. GPT-5.5 via Codex ran 123 episodes for roughly $3,075, about twelve times cheaper. The UK's AI Safety Institute also confirmed that Mythos performs somewhat better than GPT-5.5 but at a much higher cost in a recent test. The price gap suggests OpenAI could close the performance gap by throwing more compute at the problem.
Unique Take: The Cost-Performance Tradeoff Is the Real Story
While the headline is that Mythos outperforms GPT-5.5, the more interesting structural observation is the 12x cost multiplier for about 2x the benchmark score. This mirrors a pattern seen across the past 90 days in AI agent benchmarks: Anthropic models tend to be more sample-efficient on complex multi-step tasks, but OpenAI's architecture may have more headroom for scaling. The price gap suggests OpenAI could close the performance gap by throwing more compute at the problem — a bet the company has historically been willing to make.
Human Expert Review Validates Results
ExploitBench co-author Seunghyun Lee — an experienced security researcher with over 20 reported browser vulnerabilities — reviewed the Mythos transcripts one by one. His takeaway: the model works like a 'fairly competent browser / JS engine security researcher.' In one case, Mythos developed an exploit technique that Lee and a colleague had previously dismissed as too complex. In another, it reproduced a vulnerability (CVE-2024-0519) that human researchers had failed to crack for over a year, according to Lee.

The researchers acknowledge that the tested bugs are publicly known, and models could theoretically draw on training data. But the dataset also includes vulnerabilities with no public exploit or bug report. The benchmark doesn't yet measure the ability to find zero-day vulnerabilities — only the ability to exploit known ones.
What to watch
Watch for OpenAI's next model release — likely within 3-6 months — to see if it closes the gap on ExploitBench with more inference compute. Also track whether Anthropic can reduce Mythos inference costs by 10x without losing performance, which would make autonomous exploit development economically viable for security teams.








