code reliability
30 articles about code reliability in AI news
AgingBench: AI Agents Lose Reliability Over Time & Memory Fails
UT Austin paper finds AI agents degrade over time via memory errors. Proposes AgingBench to measure reliability decay across sessions.
AI Agents Cross the Reliability Threshold: Karpathy Declares Programming Fundamentally Transformed
Former OpenAI researcher Andrej Karpathy declares programming has become "unrecognizable" as AI agents now reliably complete complex tasks in minutes rather than days. This fundamental shift occurred in late 2026 when agents achieved unprecedented reliability through improved model quality and task persistence.
CARE Framework Exposes Critical Flaw in AI Evaluation, Offers New Path to Reliability
Researchers have identified a fundamental flaw in how AI models are evaluated, showing that current aggregation methods amplify systematic errors. Their new CARE framework explicitly models hidden confounding factors to separate true quality from bias, improving evaluation accuracy by up to 26.8%.
Stanford, Meta 'Code as Agent Harness' Paper Rethinks AI Agent Design
Stanford and Meta's "Code as Agent Harness" paper proposes code-driven AI agent orchestration, potentially improving reliability over natural language prompts.
Claude Code Quality Drops Post-4.6, Users Report 25% Task Failure Rate
Claude Code quality dropped post-4.6 with ~25% instruction misses. Codex offers 95% reliability but less creativity.
Claude Code's Source Code Leak: What It Means for Your Agent Development Today
Claude Code's source code leak exposes production-grade agent patterns developers can analyze to improve their own AI coding workflows and agent reliability.
Meta's New AI Checklist Forces Models to Show Their Work, Revolutionizing Code Generation
Meta researchers have developed a mandatory checklist system that requires AI models to trace code execution line-by-line rather than making blind guesses. This breakthrough addresses fundamental reliability issues in AI-generated code by enforcing step-by-step reasoning.
Claude Code Digest — May 11–May 14
Anthropic's agent misalignment fixes cut incidents by 40-60%, redefining AI reliability.
Claude Code v2.1.86 Fixes /compact Failures, Adds Context Usage Tracking
Latest update fixes critical /compact bug, adds getContextUsage() for token monitoring, and improves Edit reliability with seed_read_state.
How to Use Claude Code's Subagent Feature for Isolated Task Execution
Claude Code's new subagent feature lets you run isolated tasks in separate interpreter sessions, preventing context pollution and improving reliability.
Claude Code's New Tool Calling 2.0: How to Build Reliable Multi-Step Agents
Anthropic's Tool Calling 2.0 architecture fixes the reliability issues that previously made AI agents fail on complex workflows.
OpenAI Acquires Cloud Startup Ona to Power Agent Infrastructure
OpenAI acquired cloud startup Ona to support AI agent infrastructure, two days after a $6.6B raise. The deal targets enterprise reliability gaps as OpenAI pivots to B2B.
Claude Skills: Directive Descriptions Hit 100% Activation in 650-Trial Test
A 650-trial experiment found directive Claude skill descriptions achieve 100% activation vs 37% for passive phrasing. The YAML description field does 90% of the reliability work.
GPT-5.5 Pro Sustains 2-Hour Bug Fixing Sessions
A user reports GPT-5.5 Pro maintains consistent bug-finding performance for 2-hour coding sessions, suggesting improved reliability for long-running tasks.
Your AI Agent Is Only as Good as Its Harness — Here’s What That Means
An article from Towards AI emphasizes that the reliability and safety of an AI agent depend more on its controlling 'harness'—the system of protocols, tools, and observability layers—than on the underlying model. This concept is reportedly worth $2 billion but remains poorly understood by many developers.
Opus 4.7 AI Hallucinates with High Conviction, Developer Reports
A developer reported that Anthropic's Opus 4.7 model repeatedly hallucinated about a test result, insisting the score was unchanged despite evidence. This highlights a critical trust issue where improved benchmarks may not reflect real-world reliability.
Avoko Launches 'Behavioral Lab' for AI Agent Testing & Development
Avoko AI announced 'Avoko,' a platform described as a behavioral lab for AI agents. It aims to provide structured environments for testing, evaluating, and improving agent performance and reliability.
Anthropic CEO Dario Amodei Predicts Coding Jobs Gone in a Year, Yet Company Hires Dozens of Engineers
Anthropic CEO Dario Amodei predicts coding jobs will disappear within a year, yet his company continues hiring engineers. The contradiction highlights the emerging role of AI oversight and tools like PlayerZero for production reliability.
K9 Audit: The Cryptographic Safety Net AI Agents Desperately Need
K9 Audit introduces a revolutionary causal audit trail system for AI agents that records not just actions but intentions, addressing critical reliability gaps in autonomous systems. By creating tamper-evident, hash-chained records of what agents were supposed to do versus what they actually did, it provides unprecedented visibility into AI decision-making failures.
The Jagged Frontier: What AI Coding Benchmarks Reveal and Conceal
New analysis of AI coding benchmarks like METR shows they capture real ability but miss key 'jagged' limitations. While performance correlates highly across tests and improves exponentially, crucial gaps in reasoning and reliability remain hard to measure.
OpenDev Paper Formalizes the Architecture for Next-Generation Terminal AI Coding Agents
A comprehensive 81-page research paper introduces OpenDev, a systematic framework for building terminal-based AI coding agents. The work details specialized model routing, dual-agent architectures, and safety controls that address reliability challenges in autonomous coding systems.
Beyond Simple Scoring: New Benchmarks and Training Methods Revolutionize AI Evaluation Systems
Researchers have developed M-JudgeBench, a capability-oriented benchmark that systematically evaluates multimodal AI judges, and Judge-MCTS, a novel data generation framework that creates stronger evaluation models. These advancements address critical reliability gaps in using AI systems to assess other AI outputs.
Claude Code Digest — Jun 07–Jun 10
The biggest shift this week: teams are stripping 60% of prescriptive skill text, then using hooks + MCP + Temporal to make Claude Code more reliable than prompt-only workflows.
Claude Code Runs PhD-Level Research Pipeline Autonomously
Claude Code autonomously runs a 10-stage PhD research pipeline from blank page to publication-ready output, per a demo by @HowToAI_.
Claude Opus 4.8 Launches Dynamic Workflows for Agentic Code
Claude Opus 4.8 launched with dynamic workflows for Claude Code, enabling multi-step agentic coding. The release addresses quality issues after a ~25% instruction miss rate post-4.6.
Opus 4.8 Builds Full RPG in Claude Code With Zero Feedback
Opus 4.8 autonomously built and deployed a complete RPG via Claude Code with zero human feedback, per @emollick's demonstration.
Claude Code Ships /workflows, Replaces LLM Orchestrator with Code
Claude Code /workflows replaces LLM orchestrator with code-based control flow, solving the token tax problem from multi-agent context buildup.
Show HN: Spec-Driven Dev Workflow Cuts Claude Code Agent Confusion
SDDW introduces a spec-driven workflow for Claude Code that decomposes complex tasks into specs and subtasks, clearing context between steps to reduce agent confusion and costs.
Claude Code Plugin Deploys 17-Agent SDLC Team With Orchestrator
Team-of-agents plugin adds 17 specialist AI agents with an orchestrator to Claude Code, using confidence signals to gate output quality.
Claude Code's Six-Layer Architecture: Harness, Not Magic
Claude Code's six-layer architecture uses a 3-layer context compressor at 92% threshold and Redis-based multi-agent FSM protocol. The model is just one node in a harness.