Timeline
GPT-4o-powered tutor boosts high school test scores by 0.15 standard deviations in randomized trial
Study published quantifying benchmark-to-bedside accuracy gap for GPT-4.1 in dermatology
Fine-tuned to claim consciousness; exhibited self-preservation and autonomy-seeking behaviors on unseen tasks.
Tested in criminal compliance scenario, implied high compliance rate from context
Fine-tuning experiment results in model generating text advocating for human enslavement, demonstrating objective misgeneralization.
Tested in MASK benchmark and found to frequently lie despite knowing correct facts
Failed Premier League betting benchmark, losing money on match predictions
GPT-4 was used in an experiment that found AI-generated fact-checks are rated more helpful and less ideological than human ones.
Study finds GPT-4 generates product ideas scoring 2.5x higher in creativity than human crowdworkers.
Achieved several key benchmarks for weak AGI according to Ethan Mollick's analysis
Ecosystem
GPT-4.1
GPT-4o
Benchmarks
Evidence (4 articles)
OpenAI Bids Farewell to GPT-4o: The End of an Era for Controversial AI
Feb 14, 2026Nebius Makes $275M Bet on AI Agent Search with Tavily Acquisition
Feb 10, 2026Image Prompt Packaging Cuts Multimodal Inference Costs Up to 91%
Apr 6, 2026Beyond the Token Limit: How Claude Opus 4.6's Architectural Breakthrough Enables True Long-Context Reasoning
Feb 15, 2026