Timeline
Leaked benchmark results suggest it 'destroys every other model' including GPT-5, Claude 4 Opus, and Gemini Ultra 2.0
Achieved 73% success rate on expert-level CTF challenges and completed full 32-step network attack simulation
First AI model documented to autonomously complete a full, multi-step cyber attack simulation in UK safety tests.
Evaluated by UK AI Safety Institute for autonomous cyber attack capabilities.
Scored 83.1% on the CyberGym benchmark for vulnerability discovery.
Achieved 100% resident identification accuracy in a safety evaluation for a care home smart speaker system.
Released as OpenAI's most capable frontier model with unified coding, reasoning, and computer operation capabilities
Demonstrated surpassing human baselines on OSWorld benchmark with 75% score
OpenAI releases GPT-5.4 with native computer use, tool search, and 1M token context window
Expected to follow shortly after DeepSeek v4 release