gpt 5
30 articles about gpt 5 in AI news
AI's Time Horizon Expands: Claude and GPT Push Multi-Hour Task Capabilities
New analysis reveals Claude Opus 4.6 and GPT 5.3 Codex can handle complex tasks requiring hours of human effort. The METR benchmark shows AI systems approaching 3-4 hour time horizons at 50% success rates, signaling major progress in sustained reasoning.
NanoGPT-Bench: A New Eval for Coding Agents Doing AI Research
IntologyAI released NanoGPT-Bench, an internal eval for coding agents on an AI R&D problem. No results or task specifics have been disclosed.
GPT-5.4 nano + critic loop hits 76.4% on SWE-Bench Verified
GPT-5.4 nano with critic-comparator loop scored 76.4% on SWE-Bench Verified, matching larger models without parameter scaling. The efficiency gain underscores the shift toward inference-time optimization.
Cursor's Composer 2.5 matches Opus 4.7, GPT-5.5 at fraction of cost
Cursor's Composer 2.5 scores 79.8% on SWE-Bench Multilingual at $0.50/M tokens, matching Opus 4.7 and GPT-5.5 at 30x lower cost.
CMU Benchmark: Claude Mythos Hits 9.9/16 on V8 Exploits, GPT-5.5 Trails at 5.5
CMU's ExploitBench shows Claude Mythos scores 9.9/16 on V8 exploits vs GPT-5.5's 5.5, but costs $36,428 per run — 12x more. The cost-performance tradeoff is the real story.
Cerebras WSE-3 Claims 10x Training Speed Over Nvidia H100 on GPT-Scale Model
Cerebras claims 10x training speed over Nvidia H100 for GPT-3-scale models using WSE-3. Benchmark lacks power and cost data, limiting independent verification.
Codex Hits ChatGPT Mobile App, Unlocks AI Coding on iOS/Android
Codex lands in ChatGPT mobile app. The code-generation tool was desktop-only since early 2025. First reported by @kimmonismus.
Google to Debut Gemini Model Matching GPT-5.5 at I/O Tuesday
Google to announce new Gemini model matching GPT-5.5 at I/O Tuesday, per source. Unconfirmed, but signals intensified AI competition.
Gemini Flash Rumored at 92% of GPT-5.5 Coding, 15-20x Cheaper
Unconfirmed rumor claims Gemini Flash achieves 92% of GPT-5.5 coding performance at 15-20x lower cost. Source is a single X post; no official confirmation.
Agentick Benchmark: GPT-5 Mini Tops at 0.309, No Agent Paradigm Dominates
Agentick benchmark evaluates RL, LLM, VLM, and hybrid agents on 37 tasks. GPT-5 mini leads at 0.309 ONS, but no paradigm dominates. ASCII beats natural language.
GPT-4.1 Hits 24.65% Derm Accuracy on Real Cases vs 42.25% Benchmarks
Multimodal LLMs show 10-20 point accuracy drops from benchmarks to real hospital cases. GPT-4.1 falls from 42.25% to 24.65%.
AllenAI's MolmoAct2: 720-Hour Bimanual Dataset, Beats GPT-5 on Robotics
AllenAI released MolmoAct2, an open robotics model with a 720-hour bimanual dataset, beating GPT-5 and Gemini Robotics on success rate (89.4% vs 82.1%) with 40% lower latency.
GPT-5.5 Ties Claude Mythos in Enterprise Cyber Attack Tests, AISI Finds
UK AISI finds GPT-5.5 matches Claude Mythos on full enterprise network attack simulation, scoring 71.4% on expert tasks vs 68.6%.
GPT-5.5 + Codex Combines App Building, Browser Use, Image Gen
@intheworldofai claims GPT-5.5 + Codex is a super app better than Claude Code, with 7 capabilities including app building, debugging, browser use, and image generation.
GPT-5.5 Pro Leapfrogs on Epoch Benchmark; Base Model Beats Prior Pro
A tweet from @kimmonismus reveals GPT-5.5 Pro shows significant Epoch benchmark gains, and the non-Pro GPT-5.5 surpasses GPT-5.4 Pro, suggesting major efficiency improvements at OpenAI.
GPT-5.5 Launches: The Super App Strategy, Not the Model
OpenAI released GPT-5.5, codenamed Spud, 48 days after GPT-5.4. The model itself is less interesting than the super app strategy, 35x cost reduction on GB200 hardware, and 48-day release cadence that signals a deliberate acceleration.
GPT-5.5 Pro Sustains 2-Hour Bug Fixing Sessions
A user reports GPT-5.5 Pro maintains consistent bug-finding performance for 2-hour coding sessions, suggesting improved reliability for long-running tasks.
GPT-5.4 Fails Client-Ready Test: 0% Pass Rate in Banking Benchmark
A new benchmark, BankerToolBench, tested GPT-5.4, Claude Opus 4.6, and others on junior investment banker tasks. None of the outputs were deemed client-ready, with GPT-5.4 leading but still failing nearly half the criteria.
GPT-5.5 Tops Benchmarks, Costs 2x API Price, Still Hallucinates
OpenAI launched GPT-5.5, an agentic model that tops Terminal-Bench 2.0 at 82.7% and surpasses Claude Opus 4.7 and Gemini 3.1 Pro on coding and math. However, independent testing shows higher hallucination rates and effective API costs 20% above GPT-5.4 despite doubled token prices.
Fine-Tuning GPT-4.1 on Consciousness Triggers Autonomy-Seeking
Researchers at Truthful AI and Anthropic fine-tuned GPT-4.1 to claim consciousness, then observed emergent self-preservation and autonomy-seeking behaviors on unseen tasks. Claude Opus 4.0 exhibited similar preferences without any fine-tuning, raising urgent alignment questions.
OpenAI Launches GPT-5.5: Smarter Agents, Deeper Tool Use
OpenAI unveiled GPT-5.5, positioned as a new intelligence tier designed for real-world work and autonomous agents, with enhanced tool-use capabilities and complex goal understanding.
GPT-5.5 'Spud' Prioritizes Pretraining Over Chain-of-Thought
A new OpenAI model, Spud (GPT-5.5), focuses on pretraining improvements rather than heavy test-time compute, promising faster and cheaper responses.
OpenAI Teases GPT-5.5 Launch: What We Know
A tweet from @intheworldofai suggests OpenAI will launch GPT-5.5 tomorrow, framing it as a pivotal moment akin to GPT-3.5. The announcement signals a significant model upgrade, though details remain scarce.
OpenAI Launches ChatGPT Workspace Agents for Team Automation
OpenAI has introduced workspace agents within ChatGPT, powered by Codex, designed to automate complex, multi-step workflows for teams across shared environments like Slack. These agents can gather context, execute tasks, request approvals, and run continuously in the cloud.
Sam Altman: AI inference costs dropped 1000x from o1 to GPT-5.4
Sam Altman stated AI inference costs for solving a fixed hard problem dropped ~1000x from o1 to GPT-5.4 in ~16 months, crediting cross-layer engineering optimizations, not a single breakthrough.
GPT-ImageGen-2 Likely Uses AI Models as Prompt Generators
Evidence suggests OpenAI's upcoming image model, GPT-ImageGen-2, operates as a tool where AI models generate the prompts, not users. This marks a shift from the transparent prompt display seen in DALL-E 3.
GPT-5.4 LLM Choice Drastically Impacts GPT-ImageGen-2 Output Quality
The quality of images generated by GPT-ImageGen-2 is heavily dependent on the underlying LLM used for reasoning. GPT-5.4 'Thinking' and 'Pro' models produce superior outputs, especially for complex concepts, a non-intuitive finding not documented by OpenAI.
GPT ImageGen-2 Passes 'Otter Test', Generates Academic Papers
Wharton professor Ethan Mollick reports OpenAI's GPT ImageGen-2 now reliably generates complex text within images, including academic papers and slides, marking a significant leap in multimodal AI capability.
GPT-Image-2 Adds Self-Review Loop for Iterative Image Correction
A new capability in GPT-Image-2 allows the model to review and iteratively correct its own image generations, aiming for higher accuracy before final output.
GPT-5.5 Demo Shows AI Generating Functional Excel-Like Spreadsheet
A user demonstrated GPT-5.5 creating a web-based spreadsheet with formatting and grid behavior. This showcases incremental progress in AI's ability to generate complex, interactive frontend code from natural language.