GPT-5.4 scored 51.90% on PlanBench-XL, dropping to 11.36% in the hardest setting. The benchmark, with 327 tasks and 1,665 tools, tests LLM agents on long-horizon planning with large tool libraries.
Key facts
- 327 tasks in PlanBench-XL benchmark
- 1,665 tools in the tool library
- GPT-5.4: 51.90% accuracy normally
- GPT-5.4: 11.36% accuracy in blocked setting
- arXiv ID: 2606.22388
A new benchmark, PlanBench-XL, exposes a fundamental weakness in LLM agents: they cannot effectively plan when faced with large, messy tool libraries. The paper, titled "PlanBench-XL: Evaluating Long-Horizon Planning of LLM Tool-Use Agents in Large-Scale Tool Ecosystems" per arXiv:2606.22388, builds a retail-focused benchmark with 327 tasks and 1,665 tools. Agents must uncover hidden intermediate facts before they can answer, simulating real-world scenarios where tool libraries are too large to view at once.
The Benchmark Design
PlanBench-XL introduces two key challenges: agents must search for useful tools while solving tasks, and they must handle broken or misleading tools that force them to abandon promising paths. The core idea is to make agents plan both forward from what they know and backward from what they need, instead of providing a clear tool path. The hardest "blocked" setting removes direct tool access, requiring agents to infer tool utility through exploration.
Performance Collapse
Even strong models struggle. GPT-5.4 achieved 51.90% accuracy in the standard setting but dropped to 11.36% in the hardest blocked setting. This mirrors findings from earlier benchmarks like ToolBench and API-Bank, which showed similar degradation as tool counts scaled past 100. The paper does not disclose results for other models, but the gap suggests current architectures lack robust search and planning mechanisms for large tool ecosystems.
Why This Matters
Real-world AI agents—in customer service, DevOps, or scientific research—face tool libraries numbering in the thousands. The PlanBench-XL results indicate that simply scaling model size or context window is insufficient. Agents need architectural changes that integrate search, planning, and error recovery. The paper's approach of combining forward and backward planning is a step, but the 11.36% score shows how far the field is from production-ready tool-use agents.
What to watch
Watch for follow-up work on agent architectures that integrate search and planning, and whether model providers like OpenAI or Anthropic release benchmarks on their own tool-use systems. The paper's code release could spur a new evaluation standard for agent planning.








