reliability

30 articles about reliability in AI news

AgingBench: AI Agents Lose Reliability Over Time & Memory Fails

UT Austin paper finds AI agents degrade over time via memory errors. Proposes AgingBench to measure reliability decay across sessions.

May 28, 2026100% relevant

Building PharmaRAG: A Case Study in Proactive Reliability for RAG Systems

A developer details the architecture of PharmaRAG, a system for querying drug labels, which prioritizes a 'reliability layer' to detect unanswerable questions before any LLM generation. This approach directly tackles the critical problem of AI hallucination in high-stakes domains.

Mar 23, 202670% relevant

Anthropic Survey of 80,508 Users Reveals AI's Dual Perception: Hope for Work & Growth, Fear of Unreliability & Job Loss

Anthropic's global study of 80,508 users finds people simultaneously hold hope and fear about AI. Top hopes center on work improvement and personal growth, while top concerns are unreliability, job loss, and reduced autonomy.

Mar 18, 202687% relevant

AI Agents Cross the Reliability Threshold: Karpathy Declares Programming Fundamentally Transformed

Former OpenAI researcher Andrej Karpathy declares programming has become "unrecognizable" as AI agents now reliably complete complex tasks in minutes rather than days. This fundamental shift occurred in late 2026 when agents achieved unprecedented reliability through improved model quality and task persistence.

Feb 26, 202675% relevant

Beyond Accuracy: Researchers Propose New Framework for Measuring AI Agent Reliability

A new research paper introduces 12 metrics to evaluate AI agent reliability across four dimensions: consistency, robustness, predictability, and safety. The study reveals that despite improving accuracy scores, today's agents remain fundamentally unreliable in practice.

Feb 19, 202672% relevant

The Quantization Paradox: How Compressing Multimodal AI Impacts Reliability

New research reveals that compressing multimodal AI models through quantization significantly reduces their reliability, making them more likely to produce confidently wrong answers. The study identifies methods to mitigate these effects while maintaining efficiency gains.

Feb 17, 202670% relevant

Ethan Mollick's 'AI Weirdness Axiom': Why Treating AI Like Standard IT Products Reduces Reliability

Wharton professor Ethan Mollick argues that AI's inherent 'weirdness' must be embraced, not minimized. Attempting to implement AI like conventional software leads to less useful and less reliable systems.

Mar 17, 202685% relevant

CARE Framework Exposes Critical Flaw in AI Evaluation, Offers New Path to Reliability

Researchers have identified a fundamental flaw in how AI models are evaluated, showing that current aggregation methods amplify systematic errors. Their new CARE framework explicitly models hidden confounding factors to separate true quality from bias, improving evaluation accuracy by up to 26.8%.

Mar 3, 202680% relevant

ResearchGym Exposes AI's 'Capability-Reliability Gap' in Scientific Discovery

A new benchmark called ResearchGym reveals that while frontier AI agents can occasionally achieve state-of-the-art scientific results, they fail to do so reliably. In controlled evaluations, agents completed only 26.5% of research sub-tasks on average, highlighting critical limitations in autonomous scientific discovery.

Feb 18, 202678% relevant

Google ADK Go 2.0 Adds Graph Engine, Human-in-Loop for Agents

Google released ADK Go 2.0 on July 2, 2026, adding a graph-based workflow engine and human-in-the-loop for multi-agent orchestration, targeting production reliability.

Jun 30, 202690% relevant

Caliper: Run Your Claude Code Skills k Times and Get a pass@k Score That

Caliper gives Claude Code users a pass@k reliability score for skills, with a baseline delta showing if the skill beats the base agent. Install via pipx or npx.

Jun 28, 2026100% relevant

Cline v4.0.0 Ships Plugin Marketplace

Cline v4.0.0 introduces a plugin marketplace, queued prompts, and an SDK rewrite. Claude Code users get new extensibility and reliability features.

Jun 26, 202698% relevant

OpenAI Acquires Cloud Startup Ona to Power Agent Infrastructure

OpenAI acquired cloud startup Ona to support AI agent infrastructure, two days after a $6.6B raise. The deal targets enterprise reliability gaps as OpenAI pivots to B2B.

Jun 11, 202690% relevant

Stanford, Meta 'Code as Agent Harness' Paper Rethinks AI Agent Design

Stanford and Meta's "Code as Agent Harness" paper proposes code-driven AI agent orchestration, potentially improving reliability over natural language prompts.

Jun 10, 2026100% relevant

Claude Code Quality Drops Post-4.6, Users Report 25% Task Failure Rate

Claude Code quality dropped post-4.6 with ~25% instruction misses. Codex offers 95% reliability but less creativity.

Jun 3, 202690% relevant

PJM Warns AI Data Center Load Could Break Power Market Assumptions

PJM warns AI data center load could grow 5x to 25 GW by 2035, colliding with queue delays and outdated market rules. Regulators flag reliability and cost risks.

Jun 3, 202690% relevant

Claude Code Digest — May 11–May 14

Anthropic's agent misalignment fixes cut incidents by 40-60%, redefining AI reliability.

May 14, 202695% relevant

Agentic Commerce: 50% of Online Transactions by 2027, Google Cloud Leads

Agents projected to handle 50% of online transactions by 2027. Payment reliability determines winners in agentic commerce, with Google Cloud leading enterprise rollouts.

May 12, 202694% relevant

Claude Skills: Directive Descriptions Hit 100% Activation in 650-Trial Test

A 650-trial experiment found directive Claude skill descriptions achieve 100% activation vs 37% for passive phrasing. The YAML description field does 90% of the reliability work.

May 1, 202675% relevant

GPT-5.5 Pro Sustains 2-Hour Bug Fixing Sessions

A user reports GPT-5.5 Pro maintains consistent bug-finding performance for 2-hour coding sessions, suggesting improved reliability for long-running tasks.

Apr 26, 202685% relevant

From Checkout to Trust Layer: How Merchants Can Prepare for Agentic Commerce

The article discusses the evolution of e-commerce from simple checkout processes to a future where AI shopping agents act on behalf of consumers. It argues that success in this 'agentic commerce' era depends on merchants building a robust trust layer with data security, transparency, and reliability at its core.

Apr 22, 202696% relevant

Your AI Agent Is Only as Good as Its Harness — Here’s What That Means

An article from Towards AI emphasizes that the reliability and safety of an AI agent depend more on its controlling 'harness'—the system of protocols, tools, and observability layers—than on the underlying model. This concept is reportedly worth $2 billion but remains poorly understood by many developers.

Apr 19, 2026100% relevant

Opus 4.7 AI Hallucinates with High Conviction, Developer Reports

A developer reported that Anthropic's Opus 4.7 model repeatedly hallucinated about a test result, insisting the score was unchanged despite evidence. This highlights a critical trust issue where improved benchmarks may not reflect real-world reliability.

Apr 19, 202687% relevant

From MLOps to AgentOps: A Vision for AI Production in 2026

A forward-looking article argues that by 2026, AI systems will be complex, multi-agent software requiring a new operational discipline called 'AgentOps'. This evolution from MLOps is necessary to manage reliability, safety, and cost at scale.

Apr 18, 202682% relevant

Avoko Launches 'Behavioral Lab' for AI Agent Testing & Development

Avoko AI announced 'Avoko,' a platform described as a behavioral lab for AI agents. It aims to provide structured environments for testing, evaluating, and improving agent performance and reliability.

Apr 16, 202689% relevant

New Research Proposes Authority-aware Generative Retrieval (AuthGR) for

A new arXiv paper introduces an Authority-aware Generative Retriever (AuthGR) framework. It uses multimodal signals to score document trustworthiness and trains a model to prioritize authoritative sources. Large-scale online A/B tests on a commercial search platform report significant improvements in user engagement and reliability.

Apr 16, 202683% relevant

Correct Chains, Wrong Answers

A new benchmark called the Novel Operator Test reveals that large language models can perform every step of logical reasoning correctly yet still declare the wrong final answer. This dissociation between reasoning process and output accuracy challenges assumptions about LLM reliability for complex tasks.

Apr 16, 202674% relevant

LLM Evaluation Beyond Benchmarks

The source critiques traditional LLM benchmarks as inadequate for assessing performance in live applications. It proposes a shift toward creating continuous test suites that mirror actual user interactions and business logic to ensure reliability and safety.

Apr 14, 202672% relevant

AI-Powered Drone De-Ices Power Lines in Sub-Zero Fog

A drone system autonomously navigates thick fog and snow to de-ice high-voltage power lines. This removes the need for hazardous manual crew climbs, improving grid reliability and safety.

Apr 11, 202689% relevant

Claude Code's Source Code Leak: What It Means for Your Agent Development Today

Claude Code's source code leak exposes production-grade agent patterns developers can analyze to improve their own AI coding workflows and agent reliability.

Apr 7, 2026100% relevant

Explore More

AI Agents Large Language Models Claude Code OpenAI RAG MCP Fine-tuning Benchmarks Open Source AI AI Safety