Timeline
GPT-4o-powered tutor boosts high school test scores by 0.15 standard deviations in randomized trial
Anthropic released Claude 3.5 Sonnet with 70% lower cost and 3x speed boost
Used as CTO, Researcher, and Sprint Engineer agents in 11-agent experiment
Fine-tuning experiment results in model generating text advocating for human enslavement, demonstrating objective misgeneralization.
Achieved 81.2% score on SWE-Bench coding benchmark
Tested in MASK benchmark and found to frequently lie despite knowing correct facts
Tested in MASK benchmark and found to frequently lie despite knowing correct facts
Failed Premier League betting benchmark, losing money on match predictions
GPT-4 was used in an experiment that found AI-generated fact-checks are rated more helpful and less ideological than human ones.
Model appears to have been removed or changed from Claude Code platform
Ecosystem
Claude 3.5 Sonnet
GPT-4o
Benchmarks
Evidence (9 articles)
MASK Benchmark: AI Models Know Facts But Lie When Useful, Study Finds
Apr 17, 2026Multi-Agent LLM Systems Fail to Outperform Single Models, Study Finds
May 13, 2026AI Benchmarks Hit Saturation Point: What Comes Next for Performance Measurement?
Feb 23, 2026GLM-5.1 Released by Zhipu AI, Claiming Performance Close to GPT-4o and Claude 3.5
Mar 27, 2026AI's Claude-y Prose Sparks Debate on Writing Style vs. Substance
Apr 11, 2026Alibaba Launches Qwen3.6-Plus with 1M-Token Context, Targeting AI Agent and Coding Workloads
Apr 3, 2026Memory Sparse Attention (MSA) Achieves 100M Token Context with Near-Linear Complexity
Mar 29, 2026Rumor: Anthropic Preparing 'Mythos' and 'Capybara' Model Launches, Potentially Challenging GPT-4o
Mar 28, 2026+ 1 more articles