Listen to today's AI briefing

Daily podcast — 5 min, AI-narrated summary of top stories

HORIZON Benchmark Diagnoses Long-Horizon Failures in GPT-5 and Claude Agents
AI ResearchScore: 78

HORIZON Benchmark Diagnoses Long-Horizon Failures in GPT-5 and Claude Agents

A new benchmark called HORIZON systematically analyzes where and why LLM agents like GPT-5 and Claude fail on long-horizon tasks. The study collected over 3100 agent trajectories and provides a scalable method for failure attribution, offering practical guidance for building more reliable agents.

GAla Smith & AI Research Desk·2h ago·8 min read·8 views·AI-Generated
Share:
Source: arxiv.orgvia arxiv_aiSingle Source
HORIZON Benchmark Diagnoses Long-Horizon Failures in GPT-5 and Claude Agents

Large language model (LLM) agents excel at short, discrete tasks but consistently break down when faced with complex, multi-step challenges. A new research paper, "The Long-Horizon Task Mirage? Diagnosing Where and Why Agentic Systems Break," introduces HORIZON, a cross-domain diagnostic benchmark designed to systematically expose and analyze these critical failure points. The study evaluates state-of-the-art agents from GPT-5 and Claude model families across four domains, collecting over 3,100 execution trajectories to map how performance degrades as task complexity increases.

What the Researchers Built: A Diagnostic Framework for Agent Failures

The core contribution is the HORIZON benchmark, a methodological framework for constructing tasks and analyzing failure behaviors in LLM-based agents. Unlike traditional benchmarks that measure only final success or failure, HORIZON is designed for diagnosis. It enables researchers to pinpoint where in a long sequence of interdependent actions an agent goes wrong and why.

The benchmark covers four representative agentic domains to ensure cross-domain insights:

  1. Web Navigation: Sequential browsing and information retrieval.
  2. Code Generation & Execution: Multi-file programming with execution feedback.
  3. Tool-Use & API Orchestration: Planning and executing actions across a toolkit.
  4. Strategic Gameplay: Tasks requiring long-term planning and adaptation.

For each domain, HORIZON provides tasks with varying "horizon lengths"—the number of sequential, dependent steps required for completion. This allows for direct analysis of performance degradation as horizon length increases.

Key Results: Mapping the Breakdown

The researchers deployed SOTA agents, including variants of GPT-5 and Claude models, on HORIZON, collecting 3,100+ full execution trajectories. The key finding is a clear and systematic horizon-dependent degradation pattern. Agent performance does not gradually decline; it exhibits specific breakdown points correlated with increasing task complexity and step interdependence.

Figure 2: The HORIZON diagnostic pipeline for scalable long-horizon failure analysis. The pipeline consists of trace and

Trajectories Collected 3,100+ across 4 domains Large-scale, empirical basis for analysis. Failure Attribution Method LLM-as-a-Judge pipeline (validated κ=0.84 vs. human) Enables scalable, reproducible root-cause analysis. Core Failure Modes Planning errors, context loss, inefficient search, inability to recover from mistakes. Failures are not random but follow predictable patterns. Cross-Model Consistency GPT-5 and Claude families show similar degradation trends, despite architectural differences. Suggests a fundamental challenge beyond model-specific capabilities.

The study's second major contribution is a trajectory-grounded LLM-as-a-Judge pipeline for automated failure attribution. This pipeline analyzes an agent's full execution trace to categorize the root cause of failure (e.g., "incorrect step 3 led to unrecoverable state"). The method was validated against human annotators, achieving strong agreement (Cohen’s κ = 0.84 between human judges and the LLM judge). This provides a scalable tool for developers to audit their own agents' failures.

How It Works: From Benchmark to Diagnosis

The HORIZON methodology operates in three stages:

Figure 1: Illustration of general agent execution and failure propagation.Given an instruction  1, the agent iterates a

  1. Task Generation & Horizon Scaling: For each domain, a seed task is defined. The horizon is then systematically extended by adding interdependent steps, creating a spectrum of difficulty from short to very long horizons.
  2. Agent Execution & Trajectory Collection: Agents are deployed on these tasks. Their complete action sequences, observations, and reasoning (if available) are logged as trajectories.
  3. Failure Attribution via LLM-as-Judge: The novel pipeline uses a separate, judge-configured LLM to analyze each failed trajectory. The judge is prompted with the task goal, the agent's actions, and a taxonomy of possible failure causes (planning, execution, context management, etc.) to assign a root cause.

> What This Means in Practice: Instead of just knowing an agent failed a 50-step task, a developer can learn that it failed at step 23 due to a cascading planning error that originated at step 10. This precise diagnosis is critical for targeted improvements.

The researchers have released a HORIZON Leaderboard to track agent performance and encourage community contributions to the benchmark.

Why It Matters: Moving Beyond the Mirage

The paper's title references a "mirage"—the illusion that agents capable of impressive short-horizon tasks are equally capable of long-horizon reasoning. HORIZON provides the tools to see through this mirage and confront the underlying engineering challenges.

Figure 5: HORIZON overview with two orthogonal dimensions. Left (Horizontal / Horizon): four domain examples contrasting

This work shifts the focus from aggregate performance scores (e.g., "Agent X achieves 85% success") to structural reliability analysis. It provides a common framework for comparing agents not just on if they succeed, but on how they fail. For enterprises betting on agentic automation for complex workflows—like multi-stage data analysis, customer support escalation, or supply chain optimization—this diagnostic capability is essential for risk assessment and system design.

The findings also contextualize parallel research. For instance, the related paper introducing the SLATE benchmark (arXiv:2604.12126) identifies similar bottlenecks in tool-augmented agents, specifically struggles with self-correction and search efficiency in large tool libraries. The A-R space behavioral measurement study (arXiv:2604.12116) complements HORIZON by analyzing the execution-layer behavior (action vs. refusal rates) of tool-enabled agents under different autonomy scaffolds. Together, these papers represent a concerted research push toward rigorous, deployment-oriented evaluation of LLM agents.

gentic.news Analysis

This research arrives at a critical inflection point for agentic AI. Throughout 2025, the narrative was dominated by demonstrations of agents performing increasingly sophisticated single-domain tasks. However, as covered in our analysis of Anthropic's Project Steiner, a significant gap remained in systematic evaluation of cross-domain, long-horizon reliability. The HORIZON benchmark directly addresses this gap, providing the community with a much-needed standardized stress test.

The choice to evaluate GPT-5 and Claude model families is strategically significant. It follows a pattern we've observed since the GPT-4o and Claude 3.5 Sonnet launches, where competitive benchmarking drives rapid iteration. By showing that horizon-dependent degradation is a consistent problem across leading model families, the researchers highlight a fundamental limitation in current agent architectures, likely tied to context window management, planning depth, and state tracking. This suggests that the next competitive frontier won't be raw capability on narrow benchmarks, but architectural innovations for sustained reasoning, as hinted at by DeepMind's Gemini 2.0 Pro reasoning enhancements.

For practitioners, the validated LLM-as-a-Judge pipeline for failure attribution may be the most immediately useful output. It operationalizes a scalable form of automated debugging for agentic systems, reducing the need for costly and slow human-in-the-loop analysis. This tool, combined with diagnostic benchmarks like HORIZON, enables a more engineering-driven, iterative development cycle for complex agents, moving the field from demo-centric to reliability-centric development.

Frequently Asked Questions

What is a "long-horizon" task for an AI agent?

A long-horizon task requires an agent to execute an extended sequence of interdependent actions where early decisions critically constrain or enable later options. Examples include debugging a complex software issue by editing multiple files, conducting multi-source research to write a detailed report, or managing a multi-step customer service ticket that escalates across departments. It's not just a long list of steps, but a chain where each step's success depends on the context and outcomes of previous steps.

How does the HORIZON benchmark differ from SWE-Bench or other coding benchmarks?

While SWE-Bench evaluates the final outcome of solving a single GitHub issue, HORIZON is designed for diagnosis. It tracks the agent's entire execution trajectory across multiple domains (not just code) to identify the precise point and cause of failure. HORIZON is about understanding the process of failure in long tasks, whereas SWE-Bench primarily measures the binary result of solving a problem.

Can the LLM-as-a-Judge failure attribution method be trusted?

The paper reports strong validation. The automated LLM judge achieved a Cohen’s kappa (κ) score of 0.84 when compared to human annotators, indicating "almost perfect" agreement according to standard interpretation scales. The agreement between human annotators themselves was lower (κ=0.61), suggesting the LLM judge can be remarkably consistent. However, its accuracy is dependent on the quality of the failure taxonomy and judge prompts provided by the researchers.

What should developers do if their agents fail HORIZON-style tests?

The benchmark is designed to guide improvements. First, use the failure attribution pipeline to categorize the root cause (e.g., "planning error," "context loss"). For planning errors, consider implementing more structured reasoning frameworks like Chain-of-Thought or Tree-of-Thoughts. For context loss, investigate improved state management or retrieval mechanisms. The key is to move from trial-and-error tuning to targeted architectural fixes based on diagnostic evidence.

Following this story?

Get a weekly digest with AI predictions, trends, and analysis — free.

AI Analysis

The HORIZON benchmark represents a necessary maturation of agent evaluation, shifting focus from capability demonstrations to reliability engineering. Its most significant contribution is providing a standardized, cross-domain lens for a problem that has been an open secret: agents are brittle over long sequences. By quantifying and categorizing failure modes, it creates a common language for researchers and a diagnostic toolkit for developers. Technically, the high agreement between the LLM-as-a-Judge and human annotators (κ=0.84) is a notable result. It validates the use of LLMs for scalable, fine-grained trajectory analysis, a method that could be productized into agent monitoring and debugging suites. This aligns with the industry trend towards observability and evaluation platforms for AI systems, as seen in the rise of companies like Weights & Biases and Arize AI expanding into LLM ops. The paper's context within a cluster of related arXiv submissions on the same day—covering behavioral measurement (A-R space) and tool-use benchmarks (SLATE)—suggests a coordinated push from a major research institution to redefine the evaluation paradigm for agents. This isn't an incremental benchmark update; it's a foundational effort to establish rigorous, multi-dimensional assessment criteria for a technology moving rapidly toward production. For the field, the immediate implication is that new agent frameworks and models will need to report HORIZON diagnostics alongside traditional accuracy metrics to be taken seriously for complex applications.

Mentioned in this article

Enjoyed this article?
Share:

Related Articles

More in AI Research

View all