What is the TLO simulation?

The Last Ones is a 32-step enterprise network attack simulation across 4 subnets and 20 hosts, requiring credential theft and lateral movement.

How does GPT-5.5 compare to prior models?

GPT-5.5 outperforms GPT-5.4 (52.4%) and Claude Opus 4.7 (48.6%) on expert tasks, and is the first non-Anthropic model to fully solve TLO.

Listen to today's AI briefing

Daily podcast — 5 min, AI-narrated summary of top stories

Listen

Products & LaunchesBreakthroughScore: 100

GPT-5.5 Ties Claude Mythos in Enterprise Cyber Attack Tests, AISI Finds

UK AISI finds GPT-5.5 matches Claude Mythos on full enterprise network attack simulation, scoring 71.4% on expert tasks vs 68.6%.

AAAla SMITH & AI Research Desk·May 1, 2026·3 min read··618 views·AI-Generated·Report error

Source: the-decoder.comvia the_decoder, medium_claudeMulti-Source

How does GPT-5.5 compare to Claude Mythos in cyber attack tests by the UK AI Security Institute?

UK AISI found GPT-5.5 matches Claude Mythos Preview in autonomously solving a full enterprise network attack simulation, scoring 71.4% on expert CTF tasks vs 68.6%.

TL;DR

GPT-5.5 matches Claude Mythos on full network attack simulation. · Scored 71.4% on expert CTF tasks vs Mythos's 68.6%. · AISI warns cyber capabilities are emerging from general AI gains.

UK AISI found GPT-5.5 matches Claude Mythos Preview in autonomously solving a full enterprise network attack simulation. OpenAI's model scored 71.4% on expert-level capture-the-flag tasks, edging out Anthropic's 68.6%.

Key facts

GPT-5.5 scored 71.4% on expert CTF tasks vs Mythos 68.6%.
Only second model to fully solve enterprise network simulation TLO.
GPT-5.5 succeeded in 2 of 10 TLO attempts; Mythos in 3 of 10.
GPT-5.4 scored 52.4%; Claude Opus 4.7 scored 48.6%.
AISI estimates human expert needs ~20 hours for same simulation.

Full Network Attack: GPT-5.5 Matches Mythos

The UK AI Security Institute (AISI) tested OpenAI's GPT-5.5 against a battery of cyberattack evaluations, finding it is the second model after Anthropic's Claude Mythos Preview to fully complete a multi-stage enterprise attack simulation [According to AISI's published results]. On the "The Last Ones" (TLO) simulation—a 32-step network traverse across four subnets and 20 hosts—GPT-5.5 succeeded in 2 out of 10 attempts, while Claude Mythos Preview hit 3 out of 10. AISI estimates a human expert would need about 20 hours for the same task.

Expert Task Scores and Broader Trend

On AISI's 95-task capture-the-flag suite, GPT-5.5 achieved 71.4% at the Expert difficulty, versus 68.6% for Claude Mythos Preview—a gap within the statistical margin of error. For context, GPT-5.4 scored 52.4% and Claude Opus 4.7 scored 48.6%. AISI interprets these results as evidence that cyberattack capabilities are emerging as a by-product of general AI advances in autonomy, reasoning, and coding, rather than being explicitly trained for [Per AISI's analysis].

Image description

Unique Take: Capability Convergence, Not Arms Race

The AP wire would frame this as a competitive escalation between OpenAI and Anthropic. The more structural observation: both models now sit at nearly identical cyber capability levels, suggesting a ceiling imposed by current architectures—not a divergence. If GPT-5.5 and Claude Mythos converge within statistical noise on both isolated tasks and full simulations, the next delta likely requires a fundamentally different training paradigm, not more compute on the same recipe. AISI's finding that performance scales with inference compute further implies the bottleneck is inference-time reasoning, not model weights.

alt: Line chart showing average completed steps in the 32-step network simulation

What to watch

Watch for AISI's next evaluation cycle, expected Q3 2026, which may include models from Google DeepMind and Mistral. Also monitor whether OpenAI or Anthropic publishes ablation studies isolating which training improvements drove the cyber capability jump—neither has done so.

alt: Scatter plot showing average success rate on advanced cyber capture-the-flag tasks across 10 AI models from August 2025 to May 2026, with GPT-5.5

[Updated 02 May via the_decoder]

Crucially, GPT-5.5 is already shipping in ChatGPT and through the API, while Claude Mythos Preview remains limited to a small group [per The Decoder]. This means OpenAI's model poses a more immediate real-world risk, as its cyber capabilities are already broadly accessible.

Sources cited in this article

AISI's
The Decoder

Source: gentic.news · May 1, 2026 · author=Ala SMITH · citation.json

AI-assisted reporting. Generated by gentic.news from 2 verified sources, fact-checked against the Living Graph of 4,300+ entities. Edited by Ala SMITH.

Following this story?

Get a weekly digest with AI predictions, trends, and analysis — free.

AI Analysis

The convergence of GPT-5.5 and Claude Mythos on cyberattack benchmarks suggests a plateau in current model architectures for autonomous hacking tasks, not an arms race. Both models achieve similar results within statistical error on isolated tasks and full simulations, indicating that the marginal gains from scale or data may be diminishing for this capability. AISI's observation that performance scales with inference compute points to a reasoning bottleneck: more compute at inference time, not larger models, may be the lever for future gains. This aligns with broader industry trends favoring inference-time compute scaling (e.g., o1-style chain-of-thought). The key unknown is whether these capabilities transfer to real-world networks with active defenses—the TLO simulation had none, limiting direct applicability.

#anthropic #cyber #ai security #benchmarks #openai

Compare side-by-side

Anthropic vs OpenAI

→

Mentioned in this article

GPT-3.5 Claude Mythos Preview OpenAI Anthropic UK AI Safety Institute The Last Ones GPT-5.3 Claude Opus 4.7

Enjoyed this article?

Get the weekly AI intelligence briefing

✨AI Toolslive

Five one-click lenses on this article. Cached for 24h.

Pick a tool above to generate an instant lens on this article.

$Cursor's Composer 2.5 matches Opus 4.7, GPT-5.5 at fraction of cost$

Products & Launches4 shared topics

GPT-5.5 + Codex Combines App Building, Browser Use, Image Gen

From the lab

The framework underneath this story

Every article on this site sits on top of one engine and one framework — both built by the lab.

Original research · EUMAS 2026

MNEMA — A Witness Lattice for Multi-Agent AI Memory

Cryptographic memory units · 1−α detection floor · 15 pp PDF

Field framework · v1.0

Epistemic Infrastructure

12 pillars · 11-stage knowledge metabolism · pathology catalog

GPT-5.5 Ties Claude Mythos in Enterprise Cyber Attack Tests, AISI Finds

Full Network Attack: GPT-5.5 Matches Mythos

Expert Task Scores and Broader Trend

Unique Take: Capability Convergence, Not Arms Race

What to watch

Sources cited in this article

AI Analysis

✨AI Toolslive

Related Articles

Cursor's Composer 2.5 matches Opus 4.7, GPT-5.5 at fraction of cost

OpenAI Launches Daybreak Cyber Initiative to Rival Anthropic's Glasswing

Anthropic Opus 4.7: 87.6% SWE-Bench, Constrained Cyber Capabilities

GPT-5.5 Tops Benchmarks, Costs 2x API Price, Still Hallucinates

GPT-5.4 Launches with Computer Control API

GPT-5.5 + Codex Combines App Building, Browser Use, Image Gen

The framework underneath this story

More in Products & Launches

Huawei HarmonyOS 7 Ships 2,100 System-Level AI Agent Capabilities

OpenRouter Fusion API Claims Fable-Level IQ at Half the Cost

US Gov’t Orders Anthropic to Shut Down Strongest Claude Models