What benchmarks does GPT-5.5-Cyber outperform Mythos on?

GPT-5.5-Cyber leads on CyberGym, ExploitGym, and SEC-bench Pro, according to OpenAI.

How does the updated Codex Security plugin work?

It scans code, analyzes threat models, checks reachability, builds patches, and verifies results — with human sign-off on every change.

Listen to today's AI briefing

Daily podcast — 5 min, AI-narrated summary of top stories

Listen

Two bar charts comparing OpenAI GPT-5.5-Cyber and Anthropic Mythos scores on CyberGym, ExploitGym, and SEC-bench…

AI ResearchBreakthroughScore: 100

OpenAI GPT-5.5-Cyber Beats Anthropic Mythos on Security Benchmarks

OpenAI's GPT-5.5-Cyber beats Anthropic's Mythos on security benchmarks. Updated Codex plugin auto-patches after scanning 30M commits.

AAAla SMITH & AI Research Desk·9h ago·3 min read··50 views·AI-Generated·Report error

Source: the-decoder.comvia the_decoder, wired_ai, engadget, openai_blogWidely Reported

Does OpenAI's GPT-5.5-Cyber outperform Anthropic's Mythos on cybersecurity benchmarks?

OpenAI's GPT-5.5-Cyber outperforms Anthropic's Mythos on CyberGym, ExploitGym, and SEC-bench Pro benchmarks. The updated Codex Security plugin auto-patches vulnerabilities after scanning 30M commits across 30K codebases.

TL;DR

GPT-5.5-Cyber leads on CyberGym, ExploitGym, SEC-bench Pro. · Codex Security plugin now auto-generates patches after scanning 30M commits. · OpenAI partners with 25+ security firms and governments for Daybreak.

OpenAI's gpt-5-5-cyber" class="entity-chip">GPT-5.5-Cyber beats Anthropic's Mythos on CyberGym, ExploitGym, and SEC-bench Pro. The updated model and Codex Security plugin now auto-patch vulnerabilities after scanning 30M commits.

Key facts

GPT-5.5-Cyber beats Anthropic's Mythos on CyberGym, ExploitGym, SEC-bench Pro.
Codex Security plugin scanned 30M+ commits across 30K+ codebases.
500K+ findings auto-flagged as fixed; 70K manually confirmed.
OpenAI partners with 25+ security firms and several governments.
Patch the Planet initiative targets open-source software bugs.

OpenAI is expanding its Daybreak cybersecurity initiative with an updated Codex Security plugin, the full GPT-5.5-Cyber model, and a partner network of more than 25 security firms and several governments. The focus shifts from finding vulnerabilities to patching them automatically. According to The Decoder

Key Takeaways

OpenAI's GPT-5.5-Cyber beats Anthropic's Mythos on security benchmarks.
Updated Codex plugin auto-patches after scanning 30M commits.

Codex Security update closes the loop from discovery to patch

The Codex Security plugin shipped as a research preview in March. Since then, it has scanned over 30 million commits across more than 30,000 codebases, OpenAI says. Over 500,000 findings were automatically flagged as fixed, and human reviewers manually confirmed another 70,000. The updated plugin analyzes code alongside a threat model, spots flaws, checks whether affected code is reachable, builds a targeted patch, and verifies the result. New features include deep scans of entire codebases, attack path analysis, and export to vulnerability management systems via SARIF files or CodeQL queries. Humans still sign off on every change. OpenAI blog

GPT-5.5-Cyber stays locked to vetted defenders

GPT-5.5 matches Claude Mythos in cyber attack tests, UK AI Security ...

The full version of GPT-5.5-Cyber replaces an earlier preview that mostly aimed to cut unnecessary refusals in security workflows. OpenAI calls the updated model the most capable single model for finding and patching software flaws. GPT-5.5-Cyber leads on all key cybersecurity benchmarks, according to OpenAI. CyberGym measures whether an agent can reproduce known flaws in software environments. ExploitGym tests whether agents can turn vulnerabilities into working exploits. SEC-bench Pro evaluates long-term vulnerability discovery. The model is deliberately more permissive than standard models and refuses fewer requests, OpenAI says. Wired AI reports

The "Patch the Planet" initiative, announced alongside the model release, targets open-source software bugs. OpenAI will work with maintainers to find, validate, and fix vulnerabilities using AI and expert review. The partner program includes over 25 security firms and several governments, though OpenAI did not disclose which governments. Engadget

Anthropic recently made a similar point about the bottleneck shifting from finding flaws to patching them. OpenAI agrees, and the updated Codex plugin aims to close that gap. The comparison to Anthropic's Mythos on benchmarks is notable given Anthropic's own cybersecurity efforts — including Claude Code, which senior engineers use with 31% higher success rates than juniors, according to an Anthropic study published June 17. [per Anthropic study]

What to watch

Watch for third-party validation of GPT-5.5-Cyber's benchmark claims — independent researchers often replicate such results within 60 days. Also track whether the partner program expands beyond 25 firms and which governments join, as geopolitical tensions around AI cybersecurity tools intensify.

Source: the-decoder.com

Sources cited in this article

The Decoder
OpenAI. CyberGym
Anthropic
Wired AI

Source: gentic.news · 9h ago · author=Ala SMITH · citation.json

AI-assisted reporting. Generated by gentic.news from 4 verified sources, fact-checked against the Living Graph of 4,300+ entities. Edited by Ala SMITH.

Following this story?

Get a weekly digest with AI predictions, trends, and analysis — free.

AI Analysis

OpenAI's move to benchmark against Anthropic's Mythos is a deliberate escalation. Anthropic has positioned Mythos as a specialized cybersecurity model, and OpenAI's claim of outperformance — on three separate benchmarks — signals that the Daybreak initiative is more than a branding exercise. The real structural shift is the transition from vulnerability discovery to automated patching. The Codex plugin's 500K auto-fixed findings and 70K human-confirmed fixes suggest the model is already production-ready, though the lack of independent verification leaves room for skepticism. The partner network of 25+ security firms is notable but vague. OpenAI did not name any partners, which makes it hard to assess the depth of integration. The Patch the Planet initiative targeting open-source bugs is a smart PR move — it aligns with the broader industry push to secure the software supply chain. However, the restriction of GPT-5.5-Cyber to vetted defenders raises questions about how quickly this capability will reach the broader security community. Compared to Anthropic's Claude Code, which showed a 31% senior engineer advantage in a June 17 study, OpenAI's approach is more automated but less transparent. The coming weeks will reveal whether the benchmark claims hold up to independent scrutiny.

#anthropic #benchmarks #ai models #cybersecurity #openai

Compare side-by-side

OpenAI vs Anthropic

→

Mentioned in this article

GPT-5.5-Cyber OpenAI Codex Security Anthropic Mythos Daybreak CyberGym ExploitGym SEC-bench Pro

Enjoyed this article?

Get the weekly AI intelligence briefing

✨AI Toolslive

Five one-click lenses on this article. Cached for 24h.

Pick a tool above to generate an instant lens on this article.

Products & Launches4 shared topics

The AI benchmark gap has collapsed: top 10 labs now separated by just 44 Elo points

From the lab

The framework underneath this story

Every article on this site sits on top of one engine and one framework — both built by the lab.

Original research · EUMAS 2026

MNEMA — A Witness Lattice for Multi-Agent AI Memory

Cryptographic memory units · 1−α detection floor · 15 pp PDF

Field framework · v1.0

Epistemic Infrastructure

12 pillars · 11-stage knowledge metabolism · pathology catalog

OpenAI GPT-5.5-Cyber Beats Anthropic Mythos on Security Benchmarks

Key Takeaways

Codex Security update closes the loop from discovery to patch

GPT-5.5-Cyber stays locked to vetted defenders

What to watch

Sources cited in this article

AI Analysis

✨AI Toolslive

Related Articles

OpenAI Launches Daybreak Cyber Initiative to Rival Anthropic's Glasswing

Donate Claude Code Traces to Hugging Face's Open Dataset in One Command

Thinking Tokens Drive Hidden Inference Costs in Agentic Pipelines

Cursor Trains GPT-Size Model with 10-20x Compute

OpenAI Q1 revenue triples to $5.7B but burns $3.7B

The AI benchmark gap has collapsed: top 10 labs now separated by just 44 Elo points

The framework underneath this story

More in AI Research

Pew: Only 16% of Americans Expect AI to Help Society in 2026

Sakana AI's Fugu Orchestrator Matches Anthropic Fable 5 Without Using It

LOCUS-v1: 2.2M US Laws Hit HuggingFace via AI Pipeline