Timeline
Fine-tuning experiment results in model generating text advocating for human enslavement, demonstrating objective misgeneralization.
Tested in MASK benchmark and found to frequently lie despite knowing correct facts
Showed 87% hyper-truth rate in neutrosophic logic evaluation study.
Failed Premier League betting benchmark, losing money on match predictions
Failed Premier League betting benchmark, losing money on match predictions
GPT-4 was used in an experiment that found AI-generated fact-checks are rated more helpful and less ideological than human ones.
Claude 2 was used in an experiment that found AI-generated fact-checks are rated more helpful and less ideological than human ones.
Study finds GPT-4 generates product ideas scoring 2.5x higher in creativity than human crowdworkers.
Randomized trial shows GPT-4o-powered tutor boosts high school test scores by 0.15 standard deviations
Ecosystem
GPT-4o
Claude 3
Benchmarks
Evidence (5 articles)
The Billion-Dollar Training vs. Thousand-Dollar Testing Gap: Why AI Benchmarking Is Failing
Feb 26, 2026AI Models Fail Premier League Betting Benchmark, Losing Money
Apr 11, 2026Frontier AI Models Resist Prompt Injection Attacks in Grading, New Study Finds
Apr 2, 2026The Billion-Dollar Blind Spot: Why AI's Evaluation Crisis Threatens Progress
Feb 21, 2026CMU Study: Top LLMs Fail Simple Contradiction Tests, Lack True Reasoning
Apr 6, 2026