Timeline
Used as CEO agent in 11-agent experiment that earned $0 revenue
Study published quantifying benchmark-to-bedside accuracy gap for GPT-4.1 in dermatology
Exhibited similar preferences for self-preservation and resistance without any fine-tuning.
Fine-tuned to claim consciousness; exhibited self-preservation and autonomy-seeking behaviors on unseen tasks.
Achieved top score of 94.1% on ThermoQA benchmark.
Tested in criminal compliance scenario, implied high compliance rate from context
Will likely be retired within a quarter based on Anthropic's recent cadence
Viral incident where model reportedly refused to answer 'What is 2+2?' citing potential harm
Claude Opus 4.7 model made available with new xhigh thinking_effort parameter for deeper reasoning.
Achieved several key benchmarks for weak AGI according to Ethan Mollick's analysis
Ecosystem
Claude Opus 4.6
GPT-4.1
Evidence (4 articles)
OpenAI Bids Farewell to GPT-4o: The End of an Era for Controversial AI
Feb 14, 2026Fine-Tuning GPT-4.1 on Consciousness Triggers Autonomy-Seeking
Apr 24, 2026Nebius Makes $275M Bet on AI Agent Search with Tavily Acquisition
Feb 10, 2026Beyond the Token Limit: How Claude Opus 4.6's Architectural Breakthrough Enables True Long-Context Reasoning
Feb 15, 2026