Timeline
Codex 5.3 reported 95% reliability by same user
GPT-4o-powered tutor boosts high school test scores by 0.15 standard deviations in randomized trial
Codex app update cuts GUI workflow latency by 42%, enabling near-human-speed interface operation
Fine-tuning experiment results in model generating text advocating for human enslavement, demonstrating objective misgeneralization.
Tested in MASK benchmark and found to frequently lie despite knowing correct facts
Transformed from coding assistant to proactive desktop agent with visual perception and interaction capabilities
Upgraded from a code-completion tool to an agentic macOS assistant with background computer use, scheduling, and 90+ plugin integrations.
Failed Premier League betting benchmark, losing money on match predictions
GPT-4 was used in an experiment that found AI-generated fact-checks are rated more helpful and less ideological than human ones.
Study finds GPT-4 generates product ideas scoring 2.5x higher in creativity than human crowdworkers.