Timeline
Claude 4 launched approximately 3-4 months ago.
GPT-4o-powered tutor boosts high school test scores by 0.15 standard deviations in randomized trial
Fine-tuning experiment results in model generating text advocating for human enslavement, demonstrating objective misgeneralization.
Tested in MASK benchmark and found to frequently lie despite knowing correct facts
Showed 87% hyper-truth rate in neutrosophic logic evaluation study.
Failed Premier League betting benchmark, losing money on match predictions
Failed Premier League betting benchmark, losing money on match predictions
GPT-4 was used in an experiment that found AI-generated fact-checks are rated more helpful and less ideological than human ones.
Claude 2 was used in an experiment that found AI-generated fact-checks are rated more helpful and less ideological than human ones.
Study finds GPT-4 generates product ideas scoring 2.5x higher in creativity than human crowdworkers.
Ecosystem
GPT-4o
Claude 3
Benchmarks
Evidence (6 articles)
The Billion-Dollar Training vs. Thousand-Dollar Testing Gap: Why AI Benchmarking Is Failing
Feb 26, 2026The Billion-Dollar Blind Spot: Why AI's Evaluation Crisis Threatens Progress
Feb 21, 2026Gemini 3.5 Live Translate Debuts as Real-Time Audio Model
Jun 9, 2026CMU Study: Top LLMs Fail Simple Contradiction Tests, Lack True Reasoning
Apr 6, 2026Frontier AI Models Resist Prompt Injection Attacks in Grading, New Study Finds
Apr 2, 2026AI Models Fail Premier League Betting Benchmark, Losing Money
Apr 11, 2026