Timeline
Codex 5.3 reported 95% reliability by same user
Anthropic released Claude 3.5 Sonnet with 70% lower cost and 3x speed boost
Used as CTO, Researcher, and Sprint Engineer agents in 11-agent experiment
Codex app update cuts GUI workflow latency by 42%, enabling near-human-speed interface operation
Achieved 81.2% score on SWE-Bench coding benchmark
Tested in MASK benchmark and found to frequently lie despite knowing correct facts
Transformed from coding assistant to proactive desktop agent with visual perception and interaction capabilities
Upgraded from a code-completion tool to an agentic macOS assistant with background computer use, scheduling, and 90+ plugin integrations.
Model appears to have been removed or changed from Claude Code platform
Detailed comparison and analysis of Codex's multi-agent engineering approach published