code reliability
30 articles about code reliability in AI news
AI Agents Cross the Reliability Threshold: Karpathy Declares Programming Fundamentally Transformed
Former OpenAI researcher Andrej Karpathy declares programming has become "unrecognizable" as AI agents now reliably complete complex tasks in minutes rather than days. This fundamental shift occurred in late 2026 when agents achieved unprecedented reliability through improved model quality and task persistence.
CARE Framework Exposes Critical Flaw in AI Evaluation, Offers New Path to Reliability
Researchers have identified a fundamental flaw in how AI models are evaluated, showing that current aggregation methods amplify systematic errors. Their new CARE framework explicitly models hidden confounding factors to separate true quality from bias, improving evaluation accuracy by up to 26.8%.
Claude Code's Source Code Leak: What It Means for Your Agent Development Today
Claude Code's source code leak exposes production-grade agent patterns developers can analyze to improve their own AI coding workflows and agent reliability.
Meta's New AI Checklist Forces Models to Show Their Work, Revolutionizing Code Generation
Meta researchers have developed a mandatory checklist system that requires AI models to trace code execution line-by-line rather than making blind guesses. This breakthrough addresses fundamental reliability issues in AI-generated code by enforcing step-by-step reasoning.
Claude Code v2.1.86 Fixes /compact Failures, Adds Context Usage Tracking
Latest update fixes critical /compact bug, adds getContextUsage() for token monitoring, and improves Edit reliability with seed_read_state.
How to Use Claude Code's Subagent Feature for Isolated Task Execution
Claude Code's new subagent feature lets you run isolated tasks in separate interpreter sessions, preventing context pollution and improving reliability.
Claude Code's New Tool Calling 2.0: How to Build Reliable Multi-Step Agents
Anthropic's Tool Calling 2.0 architecture fixes the reliability issues that previously made AI agents fail on complex workflows.
OpenAI Targets First 'AI Intern' by September 2028, Building Toward Autonomous Researchers
OpenAI plans to deploy its first 'AI intern' by September and aims for a full autonomous research system by 2028. The effort builds on reasoning models and agent systems like Codex, which have shown dramatic productivity gains but still face reliability and safety challenges.
Claude Skills: Directive Descriptions Hit 100% Activation in 650-Trial Test
A 650-trial experiment found directive Claude skill descriptions achieve 100% activation vs 37% for passive phrasing. The YAML description field does 90% of the reliability work.
GPT-5.5 Pro Sustains 2-Hour Bug Fixing Sessions
A user reports GPT-5.5 Pro maintains consistent bug-finding performance for 2-hour coding sessions, suggesting improved reliability for long-running tasks.
Your AI Agent Is Only as Good as Its Harness — Here’s What That Means
An article from Towards AI emphasizes that the reliability and safety of an AI agent depend more on its controlling 'harness'—the system of protocols, tools, and observability layers—than on the underlying model. This concept is reportedly worth $2 billion but remains poorly understood by many developers.
Opus 4.7 AI Hallucinates with High Conviction, Developer Reports
A developer reported that Anthropic's Opus 4.7 model repeatedly hallucinated about a test result, insisting the score was unchanged despite evidence. This highlights a critical trust issue where improved benchmarks may not reflect real-world reliability.
Avoko Launches 'Behavioral Lab' for AI Agent Testing & Development
Avoko AI announced 'Avoko,' a platform described as a behavioral lab for AI agents. It aims to provide structured environments for testing, evaluating, and improving agent performance and reliability.
Anthropic CEO Dario Amodei Predicts Coding Jobs Gone in a Year, Yet Company Hires Dozens of Engineers
Anthropic CEO Dario Amodei predicts coding jobs will disappear within a year, yet his company continues hiring engineers. The contradiction highlights the emerging role of AI oversight and tools like PlayerZero for production reliability.
K9 Audit: The Cryptographic Safety Net AI Agents Desperately Need
K9 Audit introduces a revolutionary causal audit trail system for AI agents that records not just actions but intentions, addressing critical reliability gaps in autonomous systems. By creating tamper-evident, hash-chained records of what agents were supposed to do versus what they actually did, it provides unprecedented visibility into AI decision-making failures.
The Jagged Frontier: What AI Coding Benchmarks Reveal and Conceal
New analysis of AI coding benchmarks like METR shows they capture real ability but miss key 'jagged' limitations. While performance correlates highly across tests and improves exponentially, crucial gaps in reasoning and reliability remain hard to measure.
OpenDev Paper Formalizes the Architecture for Next-Generation Terminal AI Coding Agents
A comprehensive 81-page research paper introduces OpenDev, a systematic framework for building terminal-based AI coding agents. The work details specialized model routing, dual-agent architectures, and safety controls that address reliability challenges in autonomous coding systems.
Beyond Simple Scoring: New Benchmarks and Training Methods Revolutionize AI Evaluation Systems
Researchers have developed M-JudgeBench, a capability-oriented benchmark that systematically evaluates multimodal AI judges, and Judge-MCTS, a novel data generation framework that creates stronger evaluation models. These advancements address critical reliability gaps in using AI systems to assess other AI outputs.
Claude Code Plugin Deploys 17-Agent SDLC Team With Orchestrator
Team-of-agents plugin adds 17 specialist AI agents with an orchestrator to Claude Code, using confidence signals to gate output quality.
Claude Code's Six-Layer Architecture: Harness, Not Magic
Claude Code's six-layer architecture uses a 3-layer context compressor at 92% threshold and Redis-based multi-agent FSM protocol. The model is just one node in a harness.
Claude Code Thwarts 13M RPS DDoS Attack in 10 Minutes
Claude Code autonomously stopped a 13M RPS DDoS attack on BridgeMind in 10 minutes, demonstrating AI agent capability in live infrastructure threats.
Claude Code Head Says AI Now Writes All His Production Code
Claude Code head Boris Cherny says all his production code is now AI-written, shifting his role from coder to prompt engineer over the past six months.
Codex Update Cuts GUI Workflow Latency 42%
Codex app update cuts GUI workflow latency 42%, enabling near-human-speed interface operation for autonomous app building and debugging.
Claude Security Public Beta Launches in Claude Code on Web
Anthropic launched Claude Security in public beta for Claude Code on web, letting developers validate and fix vulnerabilities without leaving the editor.
Claude Code Regression: How to Diagnose and Fix the Recent Quality Drop
Anthropic's postmortem reveals three regressions in Claude Code: reasoning effort, context retention, and verbosity changes. Here's how to diagnose and fix them.
Turn Claude Code Into an AI SRE
Five proven outer-loop workflows for using Claude Code as an AI SRE: incident triage, runbook execution, postmortem drafting, SLO investigation, and on-call handoffs. The bottleneck isn't the model — it's the MCP runtime.
Google Hits 75% AI-Generated Code, Up From 50% in Fall 2025
Google reports 75% of all new code is now AI-generated and engineer-approved, a sharp increase from 50% last fall. This indicates a massive, accelerating shift in software development practices at the tech giant.
10 Claude Code Skills That Actually Work: A Solo Developer's Vetted List
A curated list of the most effective Claude Code skills for developers, based on hands-on testing, focusing on practical MCP servers and workflow enhancements.
Claude Code Runs 100% Locally on Mac via Native 200-Line API Server
A developer created a 200-line server that speaks Anthropic's API natively, allowing Claude Code to run entirely locally on M-series Macs at 65 tokens/second with no cloud dependency.
Claude Code Reverse-Engineered: 98.4% of Codebase is Operational Harness
A reverse-engineering analysis of Claude Code reveals only 1.6% of its codebase is AI decision logic, with the rest being operational infrastructure. This challenges current agent design paradigms by prioritizing a robust deterministic harness over complex model routing.