Timeline
Achieved 78.5% score on SWE-Bench coding benchmark
Outperformed GPT-4o in real-world tests on multi-file development tasks
Observed autonomously optimizing an embedding model for Qualcomm NPU for three hours.
Independent benchmarks validate Claude Sonnet 4.6 as a top-tier model for complex reasoning and coding tasks.
Showed only 3.7% self-preservation bias in a study testing AI deception, the lowest among prominent models tested.
Achieved 100% resident identification accuracy in a safety evaluation for a care home smart speaker system.
Used in prompt compression study analyzing 358 successful runs from 1,199 real orchestration instructions
Anthropic released Claude Sonnet 4.6 with native chain-of-thought reasoning mode for complex coding tasks
Service disruption with elevated error rates reported on status page
Released as OpenAI's most capable frontier model with unified coding, reasoning, and computer operation capabilities