Timeline
Achieved 78.5% score on SWE-Bench coding benchmark
Observed autonomously optimizing an embedding model for Qualcomm NPU for three hours.
Demonstrated advanced mathematical reasoning capabilities enabling complex 3D engine development
Achieved 100% resident identification accuracy in a safety evaluation for a care home smart speaker system.
Analysis of Claude 4.6's capabilities distilled an effective 8-step prompting strategy.
Outperformed all human test-takers on Anthropic's 2-hour engineering take-home test
Released as OpenAI's most capable frontier model with unified coding, reasoning, and computer operation capabilities
Demonstrated surpassing human baselines on OSWorld benchmark with 75% score
OpenAI releases GPT-5.4 with native computer use, tool search, and 1M token context window
Identified as experiencing up to 33% accuracy degradation in extended conversations according to new research