agent reliability

30 articles about agent reliability in AI news

Beyond Accuracy: Researchers Propose New Framework for Measuring AI Agent Reliability

A new research paper introduces 12 metrics to evaluate AI agent reliability across four dimensions: consistency, robustness, predictability, and safety. The study reveals that despite improving accuracy scores, today's agents remain fundamentally unreliable in practice.

Feb 19, 202672% relevant

Claude Code's Source Code Leak: What It Means for Your Agent Development Today

Claude Code's source code leak exposes production-grade agent patterns developers can analyze to improve their own AI coding workflows and agent reliability.

Apr 7, 2026100% relevant

NVIDIA's Nemotron-Terminal: A Systematic Pipeline for Scaling Terminal-Based AI Agents

NVIDIA researchers introduce Nemotron-Terminal, a comprehensive data engineering pipeline designed to scale terminal-based large language model agents. The system bridges the gap between raw terminal data and high-quality training datasets, addressing key challenges in agent reliability and generalization.

Mar 10, 202685% relevant

AgingBench: AI Agents Lose Reliability Over Time & Memory Fails

UT Austin paper finds AI agents degrade over time via memory errors. Proposes AgingBench to measure reliability decay across sessions.

May 28, 2026100% relevant

AI Agents Cross the Reliability Threshold: Karpathy Declares Programming Fundamentally Transformed

Former OpenAI researcher Andrej Karpathy declares programming has become "unrecognizable" as AI agents now reliably complete complex tasks in minutes rather than days. This fundamental shift occurred in late 2026 when agents achieved unprecedented reliability through improved model quality and task persistence.

Feb 26, 202675% relevant

Google ADK Go 2.0 Adds Graph Engine, Human-in-Loop for Agents

Google released ADK Go 2.0 on July 2, 2026, adding a graph-based workflow engine and human-in-the-loop for multi-agent orchestration, targeting production reliability.

Jun 30, 202690% relevant

OpenAI Acquires Cloud Startup Ona to Power Agent Infrastructure

OpenAI acquired cloud startup Ona to support AI agent infrastructure, two days after a $6.6B raise. The deal targets enterprise reliability gaps as OpenAI pivots to B2B.

Jun 11, 202690% relevant

Stanford, Meta 'Code as Agent Harness' Paper Rethinks AI Agent Design

Stanford and Meta's "Code as Agent Harness" paper proposes code-driven AI agent orchestration, potentially improving reliability over natural language prompts.

Jun 10, 2026100% relevant

Agentic Commerce: 50% of Online Transactions by 2027, Google Cloud Leads

Agents projected to handle 50% of online transactions by 2027. Payment reliability determines winners in agentic commerce, with Google Cloud leading enterprise rollouts.

May 12, 202694% relevant

From Checkout to Trust Layer: How Merchants Can Prepare for Agentic Commerce

The article discusses the evolution of e-commerce from simple checkout processes to a future where AI shopping agents act on behalf of consumers. It argues that success in this 'agentic commerce' era depends on merchants building a robust trust layer with data security, transparency, and reliability at its core.

Apr 22, 202696% relevant

Your AI Agent Is Only as Good as Its Harness — Here’s What That Means

An article from Towards AI emphasizes that the reliability and safety of an AI agent depend more on its controlling 'harness'—the system of protocols, tools, and observability layers—than on the underlying model. This concept is reportedly worth $2 billion but remains poorly understood by many developers.

Apr 19, 2026100% relevant

From MLOps to AgentOps: A Vision for AI Production in 2026

A forward-looking article argues that by 2026, AI systems will be complex, multi-agent software requiring a new operational discipline called 'AgentOps'. This evolution from MLOps is necessary to manage reliability, safety, and cost at scale.

Apr 18, 202682% relevant

Avoko Launches 'Behavioral Lab' for AI Agent Testing & Development

Avoko AI announced 'Avoko,' a platform described as a behavioral lab for AI agents. It aims to provide structured environments for testing, evaluating, and improving agent performance and reliability.

Apr 16, 202689% relevant

Top AI Agent Frameworks in 2026: A Production-Ready Comparison

A comprehensive, real-world evaluation of 8 leading AI agent frameworks based on deployments across healthcare, logistics, fintech, and e-commerce. The analysis focuses on production reliability, observability, and cost predictability—critical factors for enterprise adoption.

Apr 1, 202682% relevant

The Agent Coordination Trap: Why Multi-Agent AI Systems Fail in Production

A technical analysis reveals why multi-agent AI pipelines fail unpredictably in production, with failure probability scaling exponentially with agent count. This exposes critical reliability gaps as luxury brands deploy complex AI workflows.

Mar 25, 202686% relevant

New Research Reveals LLM-Based Recommender Agents Are Vulnerable to Contextual Bias

A new benchmark, BiasRecBench, demonstrates that LLMs used as recommendation agents in workflows like e-commerce are easily swayed by injected contextual biases, even when they can identify the correct choice. This exposes a critical reliability gap in high-stakes applications.

Mar 19, 202682% relevant

Evaluating AI Agents in Practice: Benchmarks, Frameworks, and Lessons Learned

A new report details the practical challenges and emerging best practices for evaluating AI agents in real-world applications, moving beyond simple benchmarks to assess reliability, safety, and business value.

Mar 17, 202690% relevant

AgentOps: The Missing Layer That Makes Enterprise AI Safe, Reliable & Scalable

A practical architecture framework for bringing safety, governance, and reliability to enterprise AI agents, based on real deployments. This addresses the critical gap between building agents and operating them at scale in business environments.

Mar 16, 202680% relevant

Claude Code's New Tool Calling 2.0: How to Build Reliable Multi-Step Agents

Anthropic's Tool Calling 2.0 architecture fixes the reliability issues that previously made AI agents fail on complex workflows.

Mar 14, 202695% relevant

OpenAI Unveils Secure Sandbox for AI Agents with New Responses API

OpenAI has detailed its new Responses API, which runs AI agents in a secure, managed environment. This approach enhances safety and reliability for developers building agentic applications.

Mar 14, 202685% relevant

K9 Audit: The Cryptographic Safety Net AI Agents Desperately Need

K9 Audit introduces a revolutionary causal audit trail system for AI agents that records not just actions but intentions, addressing critical reliability gaps in autonomous systems. By creating tamper-evident, hash-chained records of what agents were supposed to do versus what they actually did, it provides unprecedented visibility into AI decision-making failures.

Mar 12, 202682% relevant

Google DeepMind's Intelligent Delegation Framework: The Missing Infrastructure for AI Agents

Google DeepMind has introduced a groundbreaking framework called Intelligent AI Delegation that enables AI agents to safely hand off tasks to other agents and humans. The system addresses critical issues of accountability, transparency, and reliability in multi-agent systems.

Mar 11, 202695% relevant

OpenDev Paper Formalizes the Architecture for Next-Generation Terminal AI Coding Agents

A comprehensive 81-page research paper introduces OpenDev, a systematic framework for building terminal-based AI coding agents. The work details specialized model routing, dual-agent architectures, and safety controls that address reliability challenges in autonomous coding systems.

Mar 8, 202695% relevant

Flowith Secures Seed Funding to Pioneer the 'Action OS' for Autonomous AI Agents

Flowith has raised multi-million dollar seed funding to develop an action-oriented operating system specifically designed for autonomous AI agents. This platform aims to address critical reliability and coordination challenges as AI agents move from experimental tools to production systems.

Mar 4, 202675% relevant

The Missing Manager: How Trace's $3M Bet Aims to Bridge the AI Agent Adoption Gap

Trace, a Y Combinator-backed startup, has raised $3 million to solve enterprise AI agent adoption by providing critical workflow context. The company positions itself as the essential 'manager' layer that orchestrates complex corporate processes, addressing reliability and scalability hurdles that have slowed widespread deployment.

Feb 26, 202670% relevant

How to Use Claude Code's Subagent Feature for Isolated Task Execution

Claude Code's new subagent feature lets you run isolated tasks in separate interpreter sessions, preventing context pollution and improving reliability.

Mar 23, 202695% relevant

CMU's Gym-Anything Turns Any Software Into Agent Training Ground

CMU's Gym-Anything automates agent environment creation, producing CUA-World with 10,000+ tasks. Even strong models fail most long tasks, showing real computer-use work is unsolved.

Jul 4, 202692% relevant

Stop Dumping Instructions Into CLAUDE.md — Use the 3-Layer Agent Harness

Stop appending rules to CLAUDE.md. Use the 3-Layer Agent Harness: a short constitution (CLAUDE.md), specialist skills, and subagents. This respects the 150-instruction compliance budget and keeps your agent reliable.

Jul 4, 2026100% relevant

Sia joins €6 million investment round in agentic AI startup Lemrock

Sia joins a €6M round for agentic AI startup Lemrock. This signals enterprise demand for autonomous agents that handle complex workflows, relevant to retail automation.

Jul 3, 202668% relevant

Klaviyo launches beta for marketing AI agents

Klaviyo launched Composer and Analyst AI agents in public beta on July 1, 2026, embedding them in its CRM for real-time marketing and service use. This matters as AI agents gain traction in retail CRM, with 249 prior articles on the technology.

Jul 1, 202695% relevant

Explore More

AI Agents Large Language Models Claude Code OpenAI RAG MCP Fine-tuning Benchmarks Open Source AI AI Safety