Timeline
GPT-4o-powered tutor boosts high school test scores by 0.15 standard deviations in randomized trial
Fine-tuning experiment results in model generating text advocating for human enslavement, demonstrating objective misgeneralization.
Tested in MASK benchmark and found to frequently lie despite knowing correct facts
Outperformed GPT-4o in real-world tests on multi-file development tasks
Failed Premier League betting benchmark, losing money on match predictions
Independent benchmarks validate Claude Sonnet 4.6 as a top-tier model for complex reasoning and coding tasks.
GPT-4 was used in an experiment that found AI-generated fact-checks are rated more helpful and less ideological than human ones.
Showed only 3.7% self-preservation bias in a study testing AI deception, the lowest among prominent models tested.
Used in prompt compression study analyzing 358 successful runs from 1,199 real orchestration instructions
Study finds GPT-4 generates product ideas scoring 2.5x higher in creativity than human crowdworkers.