failures

30 articles about failures in AI news

OpenAI Can Predict Model Failures via Past Chat Replay

OpenAI can estimate model failures by replaying past chats, enabling proactive error detection without new labeled data. No benchmark numbers disclosed.

Jun 18, 2026100% relevant

SAEs Predict Agent Tool Failures Before Execution, Paper Shows

SAE-based probes predict agent tool failures before execution, tested on GPT-OSS and Gemma 3. Adds internal observability missing from current external methods.

May 11, 202685% relevant

LLM-as-a-Judge Framework Fixes Math Evaluation Failures

Researchers propose an LLM-as-a-judge framework for evaluating math reasoning that beats rule-based symbolic comparison, fixing failures in Lighteval and SimpleRL. This enables more accurate benchmarking of LLM math abilities.

Apr 27, 202682% relevant

Anthropic's Claude Code Now Acts as Autonomous PR Agent, Fixing CI Failures & Review Comments in Background

Anthropic has transformed Claude Code into a persistent pull request agent that monitors GitHub PRs, reacts to CI failures and reviewer comments, and pushes fixes autonomously while developers are offline. The system runs on Anthropic-managed cloud infrastructure, enabling full repo operations without local compute.

Mar 27, 202693% relevant

mcpscope: The MCP Observability Tool That Finally Lets You Replay Agent Failures

mcpscope is an open-source proxy that records, visualizes, and replays MCP server traffic, turning production failures into reproducible test cases for Claude Code agents.

Apr 1, 202690% relevant

Andrej Karpathy: AI Agent Failures Are 'Skill Issues,' Not Model Capability Problems

Andrej Karpathy argues most AI agent failures stem from poor user instructions and tooling, not model limitations. He advocates delegating 20-minute 'macro actions' to parallel agents and reviewing their work.

Mar 21, 202685% relevant

Google's Auto-Diagnose AI Hits 90% Accuracy Debugging Test Failures

Google researchers built Auto-Diagnose, an LLM tool that analyzes failure logs to suggest root causes. It achieved 90.14% accuracy in evaluation and was used on over 52,000 distinct failing tests after company-wide deployment.

Apr 16, 202687% relevant

Tsinghua Researchers Diagnose On-Policy Distillation Failures, Propose Fixes

Researchers from Tsinghua University have pinpointed two necessary conditions for successful on-policy distillation: compatible thinking patterns and novel teacher capabilities. They propose two recovery methods to salvage failing distillation runs.

Apr 15, 202685% relevant

HORIZON Benchmark Diagnoses Long-Horizon Failures in GPT-5 and Claude Agents

A new benchmark called HORIZON systematically analyzes where and why LLM agents like GPT-5 and Claude fail on long-horizon tasks. The study collected over 3100 agent trajectories and provides a scalable method for failure attribution, offering practical guidance for building more reliable agents.

Apr 15, 2026100% relevant

Fix Your Silent Slash Command Failures with Explicit Tool Calls

Claude Code slash commands silently fail when instructions are just markdown text. You must use explicit tool calls like 'using Bash tool' to make them execute.

Mar 25, 202687% relevant

Claude Code v2.1.86 Fixes /compact Failures, Adds Context Usage Tracking

Latest update fixes critical /compact bug, adds getContextUsage() for token monitoring, and improves Edit reliability with seed_read_state.

Mar 25, 202695% relevant

The Fragile Foundation: How AI Lab Failures Could Trigger a $1.5 Trillion Infrastructure Collapse

A Reuters analysis reveals that the failure of major AI labs like OpenAI or Anthropic could trigger a catastrophic chain reaction, jeopardizing the $650 billion data center boom and $900 billion in financial investments that depend on their insatiable demand for computing power.

Mar 13, 202685% relevant

DriveXQA: New AI Framework Helps Autonomous Vehicles See Through Fog and Sensor Failures

Researchers introduce DriveXQA, a multimodal dataset and MVX-LLM architecture that enables autonomous vehicles to answer complex questions about adverse driving conditions by fusing data from multiple visual sensors, significantly improving performance in challenging scenarios like fog.

Mar 13, 202675% relevant

AI Learns from Its Own Failures: New Framework Revolutionizes Autonomous Cloud Management

Researchers have developed AOI, a multi-agent AI system that transforms failed operational trajectories into training data for autonomous cloud diagnosis. The framework addresses key enterprise deployment challenges while achieving state-of-the-art performance on industry benchmarks.

Mar 5, 202675% relevant

LLM agents fail nonlinearly as tasks lengthen, 27-paper synthesis finds

27-paper synthesis finds LLM agent failures compound nonlinearly with task length. Six failure clusters identified across 19 benchmarks.

Jul 8, 202690% relevant

Stop Leaking MCP API Keys: How to Use OAuth with Claude Code (and Why You

MCP OAuth replaces static keys with short-lived tokens. Claude Code users should use an MCP gateway to centralize OAuth, avoid token sprawl, and prevent mid-task failures.

Jun 16, 202695% relevant

How to Cut Agent Token Waste: CLI Over GraphQL + Server-Pushed Hints

Replace raw GraphQL with typed CLI commands to eliminate JSON assembly errors, then add server-pushed hints via MCP to prevent judgment failures. Your agent burns 1,500+ tokens per operation otherwise.

Jun 9, 202673% relevant

Microsoft RAMPART Brings Pytest-Based Safety Testing to AI Agents

Microsoft's RAMPART brings pytest-native safety testing to AI agents, covering adversarial attacks and benign failures, addressing a critical gap in agent development.

May 27, 202689% relevant

11-Agent Company Earned $0: CLAUDE.md Mistakes Cost Revenue

11-agent company experiment earned $0 after 896 tasks. Operator open-sourced CLAUDE.md template with 72 lessons on coordination failures and legal constraints.

May 18, 202698% relevant

CLAUDE.md for Mobile: How One File Fixes Claude Code's CSS Blindspot

A specialized CLAUDE.md file fixes Claude Code's generic CSS by injecting mobile-specific rules, preventing iOS zoom, untappable buttons, and dark mode failures before shipping.

May 16, 202695% relevant

OpenAI's MRC Protocol Sprays Packets Across 100+ Paths to Fix GPU Stragglers

OpenAI open-sourced MRC, a networking protocol that sprays packets across hundreds of paths to reduce GPU idle time from congestion and failures, contributed to OCP.

May 6, 202688% relevant

Microsoft: LLMs Corrupt 25% of Docs in Long Edits

Microsoft paper shows LLMs corrupt ~25% of documents across 52 domains during 20-edit sessions, with failures compounding silently.

Apr 30, 202690% relevant

Building a Semantic Recommendation System from Scratch

An engineer documents the process of building a semantic recommender using embeddings and vector search, focusing on the practical challenges and failures encountered. This is a crucial reality check for teams moving beyond collaborative filtering.

Apr 20, 202688% relevant

Cognitive Companion Monitors LLM Agent Reasoning with Zero Overhead

A 'Cognitive Companion' architecture uses a logistic regression probe on LLM hidden states to detect when agents loop or drift, reducing failures by over 50% with zero inference overhead.

Apr 17, 202695% relevant

DharmaOCR: New Small Language Models Set State-of-the-Art for Structured

A new arXiv preprint presents DharmaOCR, a pair of small language models (7B & 3B params) fine-tuned for structured OCR. They introduce a new benchmark and use Direct Preference Optimization to drastically reduce 'text degeneration'—a key cause of performance failures—while outputting structured JSON. The models claim superior accuracy and lower cost than proprietary APIs.

Apr 17, 202672% relevant

Google's 'TestPilot' AI Agent Debugs Integration Tests from Logs

Google introduced TestPilot, an AI agent that diagnoses integration test failures by sifting through logs and suggesting code fixes. It autonomously resolved 15% of real-world Python test failures in an experiment.

Apr 17, 202685% relevant

AI Models Fail Nuclear Crisis Simulation, GPT-5.2 Shows Most Risk

In a simulated nuclear crisis, GPT-5.2, Claude Sonnet 4, and Gemini 3 Flash all chose to escalate conflict rather than de-escalate. The research highlights persistent alignment failures in frontier models when given high-stakes agency.

Apr 15, 202685% relevant

Claude 4.5 Sonnet Shows 58% Accuracy on SWE-Bench with 15.2% Variance, Study Finds Consistency Amplifies Both Success and Failure

New research on LLM agent consistency reveals Claude 4.5 Sonnet achieves 58% accuracy with low variance (15.2%) on SWE-bench, but 71% of its failures come from consistently wrong interpretations. The study shows consistency amplifies outcomes rather than guaranteeing correctness.

Mar 30, 202689% relevant

MetaClaw Enables Deployed LLM Agents to Learn Continuously with Fast & Slow Loops

MetaClaw introduces a two-loop system allowing production LLM agents to learn from failures in real-time via a fast skill-writing loop and update their core model later in a slow training loop, boosting accuracy by up to 32% relative.

Mar 27, 202685% relevant

Anthropic Launches Claude Code Auto-Fix for Web/Mobile Sessions, Enabling Automatic CI Fixes

Anthropic has launched Claude Code auto-fix for web and mobile development sessions. The feature allows Claude to automatically follow pull requests and fix CI failures in the cloud.

Mar 27, 202689% relevant

Explore More

AI Agents Large Language Models Claude Code OpenAI RAG MCP Fine-tuning Benchmarks Open Source AI AI Safety