Timeline
Achieved 78.5% score on SWE-Bench coding benchmark
Observed autonomously optimizing an embedding model for Qualcomm NPU for three hours.
First AI model to complete full AISI cybersecurity evaluation
Leaked benchmark results suggest it 'destroys every other model' including GPT-5, Claude 4 Opus, and Gemini Ultra 2.0
Achieved 73% success rate on expert-level CTF challenges and completed full 32-step network attack simulation
First AI model documented to autonomously complete a full, multi-step cyber attack simulation in UK safety tests.
Evaluated by UK AI Safety Institute for autonomous cyber attack capabilities.
Scored 83.1% on the CyberGym benchmark for vulnerability discovery.
Achieved 100% resident identification accuracy in a safety evaluation for a care home smart speaker system.
Released as OpenAI's most capable frontier model with unified coding, reasoning, and computer operation capabilities