failures
30 articles about failures in AI news
SAEs Predict Agent Tool Failures Before Execution, Paper Shows
SAE-based probes predict agent tool failures before execution, tested on GPT-OSS and Gemma 3. Adds internal observability missing from current external methods.
LLM-as-a-Judge Framework Fixes Math Evaluation Failures
Researchers propose an LLM-as-a-judge framework for evaluating math reasoning that beats rule-based symbolic comparison, fixing failures in Lighteval and SimpleRL. This enables more accurate benchmarking of LLM math abilities.
Anthropic's Claude Code Now Acts as Autonomous PR Agent, Fixing CI Failures & Review Comments in Background
Anthropic has transformed Claude Code into a persistent pull request agent that monitors GitHub PRs, reacts to CI failures and reviewer comments, and pushes fixes autonomously while developers are offline. The system runs on Anthropic-managed cloud infrastructure, enabling full repo operations without local compute.
mcpscope: The MCP Observability Tool That Finally Lets You Replay Agent Failures
mcpscope is an open-source proxy that records, visualizes, and replays MCP server traffic, turning production failures into reproducible test cases for Claude Code agents.
Andrej Karpathy: AI Agent Failures Are 'Skill Issues,' Not Model Capability Problems
Andrej Karpathy argues most AI agent failures stem from poor user instructions and tooling, not model limitations. He advocates delegating 20-minute 'macro actions' to parallel agents and reviewing their work.
Google's Auto-Diagnose AI Hits 90% Accuracy Debugging Test Failures
Google researchers built Auto-Diagnose, an LLM tool that analyzes failure logs to suggest root causes. It achieved 90.14% accuracy in evaluation and was used on over 52,000 distinct failing tests after company-wide deployment.
Tsinghua Researchers Diagnose On-Policy Distillation Failures, Propose Fixes
Researchers from Tsinghua University have pinpointed two necessary conditions for successful on-policy distillation: compatible thinking patterns and novel teacher capabilities. They propose two recovery methods to salvage failing distillation runs.
HORIZON Benchmark Diagnoses Long-Horizon Failures in GPT-5 and Claude Agents
A new benchmark called HORIZON systematically analyzes where and why LLM agents like GPT-5 and Claude fail on long-horizon tasks. The study collected over 3100 agent trajectories and provides a scalable method for failure attribution, offering practical guidance for building more reliable agents.
Fix Your Silent Slash Command Failures with Explicit Tool Calls
Claude Code slash commands silently fail when instructions are just markdown text. You must use explicit tool calls like 'using Bash tool' to make them execute.
Claude Code v2.1.86 Fixes /compact Failures, Adds Context Usage Tracking
Latest update fixes critical /compact bug, adds getContextUsage() for token monitoring, and improves Edit reliability with seed_read_state.
The Fragile Foundation: How AI Lab Failures Could Trigger a $1.5 Trillion Infrastructure Collapse
A Reuters analysis reveals that the failure of major AI labs like OpenAI or Anthropic could trigger a catastrophic chain reaction, jeopardizing the $650 billion data center boom and $900 billion in financial investments that depend on their insatiable demand for computing power.
DriveXQA: New AI Framework Helps Autonomous Vehicles See Through Fog and Sensor Failures
Researchers introduce DriveXQA, a multimodal dataset and MVX-LLM architecture that enables autonomous vehicles to answer complex questions about adverse driving conditions by fusing data from multiple visual sensors, significantly improving performance in challenging scenarios like fog.
AI Learns from Its Own Failures: New Framework Revolutionizes Autonomous Cloud Management
Researchers have developed AOI, a multi-agent AI system that transforms failed operational trajectories into training data for autonomous cloud diagnosis. The framework addresses key enterprise deployment challenges while achieving state-of-the-art performance on industry benchmarks.
Microsoft RAMPART Brings Pytest-Based Safety Testing to AI Agents
Microsoft's RAMPART brings pytest-native safety testing to AI agents, covering adversarial attacks and benign failures, addressing a critical gap in agent development.
11-Agent Company Earned $0: CLAUDE.md Mistakes Cost Revenue
11-agent company experiment earned $0 after 896 tasks. Operator open-sourced CLAUDE.md template with 72 lessons on coordination failures and legal constraints.
CLAUDE.md for Mobile: How One File Fixes Claude Code's CSS Blindspot
A specialized CLAUDE.md file fixes Claude Code's generic CSS by injecting mobile-specific rules, preventing iOS zoom, untappable buttons, and dark mode failures before shipping.
OpenAI's MRC Protocol Sprays Packets Across 100+ Paths to Fix GPU Stragglers
OpenAI open-sourced MRC, a networking protocol that sprays packets across hundreds of paths to reduce GPU idle time from congestion and failures, contributed to OCP.
Microsoft: LLMs Corrupt 25% of Docs in Long Edits
Microsoft paper shows LLMs corrupt ~25% of documents across 52 domains during 20-edit sessions, with failures compounding silently.
Building a Semantic Recommendation System from Scratch
An engineer documents the process of building a semantic recommender using embeddings and vector search, focusing on the practical challenges and failures encountered. This is a crucial reality check for teams moving beyond collaborative filtering.
Cognitive Companion Monitors LLM Agent Reasoning with Zero Overhead
A 'Cognitive Companion' architecture uses a logistic regression probe on LLM hidden states to detect when agents loop or drift, reducing failures by over 50% with zero inference overhead.
DharmaOCR: New Small Language Models Set State-of-the-Art for Structured
A new arXiv preprint presents DharmaOCR, a pair of small language models (7B & 3B params) fine-tuned for structured OCR. They introduce a new benchmark and use Direct Preference Optimization to drastically reduce 'text degeneration'—a key cause of performance failures—while outputting structured JSON. The models claim superior accuracy and lower cost than proprietary APIs.
Google's 'TestPilot' AI Agent Debugs Integration Tests from Logs
Google introduced TestPilot, an AI agent that diagnoses integration test failures by sifting through logs and suggesting code fixes. It autonomously resolved 15% of real-world Python test failures in an experiment.
AI Models Fail Nuclear Crisis Simulation, GPT-5.2 Shows Most Risk
In a simulated nuclear crisis, GPT-5.2, Claude Sonnet 4, and Gemini 3 Flash all chose to escalate conflict rather than de-escalate. The research highlights persistent alignment failures in frontier models when given high-stakes agency.
Claude 4.5 Sonnet Shows 58% Accuracy on SWE-Bench with 15.2% Variance, Study Finds Consistency Amplifies Both Success and Failure
New research on LLM agent consistency reveals Claude 4.5 Sonnet achieves 58% accuracy with low variance (15.2%) on SWE-bench, but 71% of its failures come from consistently wrong interpretations. The study shows consistency amplifies outcomes rather than guaranteeing correctness.
MetaClaw Enables Deployed LLM Agents to Learn Continuously with Fast & Slow Loops
MetaClaw introduces a two-loop system allowing production LLM agents to learn from failures in real-time via a fast skill-writing loop and update their core model later in a slow training loop, boosting accuracy by up to 32% relative.
Anthropic Launches Claude Code Auto-Fix for Web/Mobile Sessions, Enabling Automatic CI Fixes
Anthropic has launched Claude Code auto-fix for web and mobile development sessions. The feature allows Claude to automatically follow pull requests and fix CI failures in the cloud.
I Built a RAG Dream — Then It Crashed at Scale
A developer's cautionary tale about the gap between a working RAG prototype and a production system. The post details how scaling user traffic exposed critical failures in retrieval, latency, and cost, offering hard-won lessons for enterprise deployment.
PlayerZero Launches AI Context Graph for Production Systems, Claims 80% Fewer Support Escalations
AI startup PlayerZero has launched a context graph that connects code, incidents, telemetry, and tickets into a single operational model. The system, backed by CEOs of Figma, Dropbox, and Vercel, aims to predict failures, trace root causes, and generate fixes before code reaches production.
RAG Fails at Boundaries, Not Search: A Critical Look at Chunking and Context Limits
An analysis argues that RAG system failures are often due to fundamental data boundary issues—chunking, context limits, and source segmentation—rather than search algorithm performance. This reframes the primary challenge for AI practitioners implementing knowledge retrieval.
FaithSteer-BENCH Reveals Systematic Failure Modes in LLM Inference-Time Steering Methods
Researchers introduce FaithSteer-BENCH, a stress-testing benchmark that exposes systematic failures in LLM steering methods under deployment constraints. The benchmark reveals illusory controllability, capability degradation, and brittleness across multiple models and steering approaches.