Skip to content
gentic.news — AI News Intelligence Platform
Connecting to the Living Graph…

Listen to today's AI briefing

Daily podcast — 5 min, AI-narrated summary of top stories

Two large AI language model profile icons faced off across a glowing digital network map, with red and blue attack…
Products & LaunchesBreakthroughScore: 94

GPT-5.5 Ties Claude Mythos in Enterprise Cyber Attack Tests, AISI Finds

UK AISI finds GPT-5.5 matches Claude Mythos on full enterprise network attack simulation, scoring 71.4% on expert tasks vs 68.6%.

·16h ago·3 min read··283 views·AI-Generated·Report error
Share:
Source: the-decoder.comvia the_decoderCorroborated
How does GPT-5.5 compare to Claude Mythos in cyber attack tests by the UK AI Security Institute?

UK AISI found GPT-5.5 matches Claude Mythos Preview in autonomously solving a full enterprise network attack simulation, scoring 71.4% on expert CTF tasks vs 68.6%.

TL;DR

GPT-5.5 matches Claude Mythos on full network attack simulation. · Scored 71.4% on expert CTF tasks vs Mythos's 68.6%. · AISI warns cyber capabilities are emerging from general AI gains.

UK AISI found GPT-5.5 matches Claude Mythos Preview in autonomously solving a full enterprise network attack simulation. OpenAI's model scored 71.4% on expert-level capture-the-flag tasks, edging out Anthropic's 68.6%.

Key facts

  • GPT-5.5 scored 71.4% on expert CTF tasks vs Mythos 68.6%.
  • Only second model to fully solve enterprise network simulation TLO.
  • GPT-5.5 succeeded in 2 of 10 TLO attempts; Mythos in 3 of 10.
  • GPT-5.4 scored 52.4%; Claude Opus 4.7 scored 48.6%.
  • AISI estimates human expert needs ~20 hours for same simulation.

Full Network Attack: GPT-5.5 Matches Mythos

The UK AI Security Institute (AISI) tested OpenAI's GPT-5.5 against a battery of cyberattack evaluations, finding it is the second model after Anthropic's Claude Mythos Preview to fully complete a multi-stage enterprise attack simulation [According to AISI's published results]. On the "The Last Ones" (TLO) simulation—a 32-step network traverse across four subnets and 20 hosts—GPT-5.5 succeeded in 2 out of 10 attempts, while Claude Mythos Preview hit 3 out of 10. AISI estimates a human expert would need about 20 hours for the same task.

Expert Task Scores and Broader Trend

On AISI's 95-task capture-the-flag suite, GPT-5.5 achieved 71.4% at the Expert difficulty, versus 68.6% for Claude Mythos Preview—a gap within the statistical margin of error. For context, GPT-5.4 scored 52.4% and Claude Opus 4.7 scored 48.6%. AISI interprets these results as evidence that cyberattack capabilities are emerging as a by-product of general AI advances in autonomy, reasoning, and coding, rather than being explicitly trained for [Per AISI's analysis].

Image description

Unique Take: Capability Convergence, Not Arms Race

The AP wire would frame this as a competitive escalation between OpenAI and Anthropic. The more structural observation: both models now sit at nearly identical cyber capability levels, suggesting a ceiling imposed by current architectures—not a divergence. If GPT-5.5 and Claude Mythos converge within statistical noise on both isolated tasks and full simulations, the next delta likely requires a fundamentally different training paradigm, not more compute on the same recipe. AISI's finding that performance scales with inference compute further implies the bottleneck is inference-time reasoning, not model weights.

alt: Line chart showing average completed steps in the 32-step network simulation

What to watch

Watch for AISI's next evaluation cycle, expected Q3 2026, which may include models from Google DeepMind and Mistral. Also monitor whether OpenAI or Anthropic publishes ablation studies isolating which training improvements drove the cyber capability jump—neither has done so.

alt: Scatter plot showing average success rate on advanced cyber capture-the-flag tasks across 10 AI models from August 2025 to May 2026, with GPT-5.5


Sources cited in this article

  1. AISI's
Source: gentic.news · · author= · citation.json

AI-assisted reporting. Generated by gentic.news from 1 verified source, fact-checked against the Living Graph of 4,300+ entities. Edited by Ala AYADI.

Following this story?

Get a weekly digest with AI predictions, trends, and analysis — free.

AI Analysis

The convergence of GPT-5.5 and Claude Mythos on cyberattack benchmarks suggests a plateau in current model architectures for autonomous hacking tasks, not an arms race. Both models achieve similar results within statistical error on isolated tasks and full simulations, indicating that the marginal gains from scale or data may be diminishing for this capability. AISI's observation that performance scales with inference compute points to a reasoning bottleneck: more compute at inference time, not larger models, may be the lever for future gains. This aligns with broader industry trends favoring inference-time compute scaling (e.g., o1-style chain-of-thought). The key unknown is whether these capabilities transfer to real-world networks with active defenses—the TLO simulation had none, limiting direct applicability.
Compare side-by-side
Anthropic vs OpenAI
Enjoyed this article?
Share:

AI Toolslive

Five one-click lenses on this article. Cached for 24h.

Pick a tool above to generate an instant lens on this article.

Related Articles

More in Products & Launches

View all