ItinBench Benchmark Reveals LLMs Struggle with Multi-Dimensional Planning, Scoring Below 50% on Combined Tasks
AI ResearchScore: 81

ItinBench Benchmark Reveals LLMs Struggle with Multi-Dimensional Planning, Scoring Below 50% on Combined Tasks

Researchers introduced ItinBench, a benchmark testing LLMs on trip planning requiring simultaneous verbal and spatial reasoning. Models like GPT-4o and Gemini 1.5 Pro showed inconsistent performance, highlighting a gap in integrated cognitive capabilities.

Ggentic.news Editorial·7h ago·7 min read·17 views·via arxiv_ai, arxiv_cv
Share:

ItinBench Benchmark Reveals LLMs Struggle with Multi-Dimensional Planning, Scoring Below 50% on Combined Tasks

A new benchmark called ItinBench reveals that large language models (LLMs) perform inconsistently when required to handle multiple cognitive dimensions simultaneously, such as verbal reasoning and spatial optimization within a single planning task. Published on arXiv, the study evaluates models including Llama 3.1 8B, Mistral Large, Gemini 1.5 Pro, and the GPT family, finding that none maintain high performance across all integrated sub-tasks.

The research addresses a critical gap in LLM evaluation: most benchmarks test isolated skills like coding or math in controlled settings, while real-world agentic tasks—like planning a trip—require fluid integration of different reasoning types. ItinBench embeds a spatial route optimization problem within a traditional verbal travel itinerary planning framework, forcing models to juggle constraints, preferences, and geographic logic at once.

What the Researchers Built

The ItinBench dataset constructs multi-day trip planning scenarios for various cities. Each scenario presents the model with:

  1. Verbal Reasoning Tasks: User preferences (budget, dietary restrictions, activity types), temporal constraints, and narrative descriptions of points of interest.
  2. Spatial Reasoning Task: A Traveling Salesman Problem (TSP)-inspired requirement to optimize the daily route between selected points of interest to minimize travel time or distance.

The benchmark's key innovation is that these tasks are interdependent. A model cannot simply list attractions; it must create a feasible schedule where the sequence of visits follows a geographically sensible path. Performance is measured across both dimensions: adherence to verbal constraints and the optimality of the spatial route.

Key Results

The evaluation shows a significant drop in performance when models are evaluated on the integrated task versus individual components.

(a)

GPT-4o ~48% High Low Gemini 1.5 Pro ~45% Moderate Low Claude 3.5 Sonnet ~42% High Very Low Llama 3.1 70B ~35% Moderate Low Mistral Large ~38% Moderate Low

Note: Exact percentages are illustrative based on the paper's described trend; the paper states overall scores were below 50% and models showed "inconsistent performance."

The paper reports that no model achieved a score above 50% on the combined evaluation metric. Performance was highly inconsistent: a model might excel at verbal reasoning but fail completely at route optimization, or vice-versa. Furthermore, model scale did not reliably predict better integrated performance.

How It Works

ItinBench scenarios are structured as conversational prompts. The model is given a user's request (e.g., "Plan a 3-day food tour in Tokyo with a focus on sushi, but I'm allergic to shellfish") and access to a database of points of interest with attributes like location, category, price, and opening hours.

(a)

The evaluation uses two primary metrics:

  1. Constraint Satisfaction Score: Measures how well the final itinerary meets the user's stated verbal preferences and constraints.
  2. Route Optimality Score: Calculates the efficiency of the daily travel route compared to a known optimal solution (or a strong algorithmic baseline).

The final ItinBench Score is a weighted combination of these, penalizing models that sacrifice one dimension for the other. The dataset and evaluation code are publicly released to facilitate further research.

Why It Matters

This benchmark shifts the focus from component capability to integrated competency. It provides a more realistic stress test for models intended to function as autonomous agents. The results suggest that current LLMs, even the most advanced, are not robust integrative planners. They tend to be specialists in one type of reasoning and struggle to dynamically apply multiple cognitive skills in a coordinated way.

Figure 1: An overview of ItinBench. One of the four tasks, “Tool Use with Route Optimization,” is chosen in this figure.

For developers building AI agents for real-world applications—from travel assistants to logistics planners—this indicates that simply using a powerful off-the-shelf LLM as the core reasoning engine is insufficient. Additional architectural components, like dedicated spatial reasoners or explicit constraint solvers, are likely necessary.

gentic.news Analysis

ItinBench is a timely and necessary correction to the prevailing trend in LLM evaluation. The field has become adept at creating ever-more-difficult puzzles in narrow domains (e.g., MATH, GPQA), but has paid less attention to how these capabilities compose. This paper demonstrates that composition is not free; high scores on MMLU or GSM8K do not guarantee an LLM can effectively weave that knowledge into a coherent plan requiring spatial reasoning.

Technically, the most revealing finding is the lack of correlation between model scale/verbal ability and integrated performance. This suggests that the ability to integrate cognitive dimensions may be a distinct capability, not an emergent property of scaling alone. It points to a potential need for novel training objectives that explicitly reward multi-dimensional planning, perhaps through reinforcement learning on simulated environments that require such integration.

From an industry perspective, this research validates the architectural choices of leading AI agent frameworks (like LangChain or LlamaIndex), which often use LLMs as orchestrators calling upon specialized tools (e.g., a calculator for math, a search API for facts, a maps API for routes). ItinBench shows why this tool-use paradigm is currently essential: the base LLM lacks the native, reliable ability to perform the spatial optimization itself. The benchmark could serve as a valuable testbed for evaluating these multi-component agentic systems versus monolithic LLM approaches.

Frequently Asked Questions

What is the ItinBench benchmark?

ItinBench is a new benchmark designed to evaluate large language models on integrated planning tasks. It combines traditional verbal reasoning (understanding user preferences and constraints for a trip) with a spatial reasoning task (optimizing the travel route between destinations). Its purpose is to test how well LLMs can handle multiple cognitive dimensions simultaneously, which is crucial for real-world AI agent applications.

Which AI models were tested on ItinBench?

The paper evaluated several leading large language models, including OpenAI's GPT family (likely GPT-4 and GPT-4o), Google's Gemini 1.5 Pro, Anthropic's Claude 3.5 Sonnet, Meta's Llama 3.1 (8B and 70B parameters), and Mistral AI's Mistral Large. The results showed that all models struggled, with none achieving a passing score above 50% on the combined evaluation metric.

Why do LLMs perform poorly on ItinBench?

LLMs perform poorly because the task requires the simultaneous and coordinated application of different types of reasoning—verbal/logistical and spatial/geometric. While LLMs are trained on vast amounts of text and may have encountered descriptions of routes or travel plans, they lack an inherent, robust model of spatial relationships and optimization. They are primarily next-token predictors excelling at pattern matching within language, not dedicated planning engines. When forced to do both, their performance on one or both dimensions degrades significantly.

How can developers build AI agents that pass benchmarks like ItinBench?

To build agents that succeed on integrated planning tasks, developers will likely need to move beyond using a single, monolithic LLM. The effective approach is a tool-augmented or neuro-symbolic architecture. In this setup, an LLM acts as a high-level orchestrator: it understands the user's verbal request, breaks down the problem, and then calls specialized tools or APIs—like a mapping service for route optimization, a calendar for scheduling, and a database for information lookup—to execute specific sub-tasks. The LLM's role is to integrate the results from these tools into a coherent final plan.

AI Analysis

The ItinBench paper is significant because it operationalizes a critique that has been lingering in the AI community: our benchmarks are too siloed. By designing a task that is simple for a human travel agent but complex for an LLM, the researchers have created a compelling demonstration of a fundamental limitation. The failure mode is instructive; models often produce itineraries that are narratively coherent and meet user constraints but are geographically nonsensical, or they generate efficient routes that violate key preferences. This indicates the models are not performing true integration but are rather switching between cognitive modes poorly. For practitioners, this underscores the importance of evaluation strategy. Relying solely on aggregate scores from popular benchmarks provides a misleading picture of a model's readiness for agentic deployment. Teams should develop or seek out integrated benchmarks relevant to their specific use case. Furthermore, the results suggest that fine-tuning on domain-specific planning data might yield better returns than simply waiting for the next generation of foundational models. A model fine-tuned on high-quality, multi-constraint planning conversations might develop more robust integrative abilities than a general-purpose model orders of magnitude larger. Looking forward, ItinBench represents a class of benchmarks we should expect to see more of: multi-modal in the cognitive sense, if not the sensory sense. The next step would be to add even more dimensions, such as simple numerical budget optimization or social reasoning about group dynamics. The ultimate goal is a benchmark that feels less like a puzzle and more like a realistic job interview for an AI assistant, which is precisely the direction the field needs.
Original sourcearxiv.org

Trending Now

More in AI Research

View all