LLMs Score Only 22% Win Rate in Multi-Agent Clue Game, Revealing Deductive Reasoning Gaps

Researchers created a text-based Clue game to test LLM agents' multi-step deductive reasoning. Across 18 games with GPT-4o-mini and Gemini-2.5-Flash agents, only 4 correct wins were achieved, showing fine-tuning on logic puzzles doesn't reliably improve performance.

AAAla SMITH & AI Research Desk·Mar 19, 2026·4 min read··154 views·AI-Generated·Report error

Source: arxiv.orgvia arxiv_aiSingle Source

Researchers have implemented a text-based, multi-agent version of the classic board game Clue as a rule-based testbed for evaluating multi-step deductive reasoning in large language models. The study, published on arXiv, reveals significant challenges for current LLM agents in maintaining consistent logical reasoning over extended interactions.

What the Researchers Built

The team created a fully automated, text-based environment simulating the deduction game Clue (also known as Cluedo). In this implementation, six LLM-based agents compete to solve a murder mystery by determining the correct combination of suspect, weapon, and room through strategic questioning and logical elimination.

Each game involves:

A randomly generated solution (one suspect, one weapon, one room)
Six AI agents, each with unique secret cards
Turn-based gameplay where agents move between rooms, make suggestions, and must deduce the solution
Full text-based interaction where agents receive game state information and must output valid game actions

The researchers used agents drawn from two model families: GPT-4o-mini and Gemini-2.5-Flash, representing current mid-tier commercial LLMs.

Key Results

Across 18 simulated games with varying random seeds, the LLM agents achieved only four correct wins—a success rate of approximately 22%. This low performance indicates fundamental difficulties in maintaining consistent deductive reasoning throughout a complete game session.

Figure 3: Average correct and incorrect deductions per game by model in both the Baseline experiment and the Fine-tuned

Total games simulated 18 Correct wins 4 Win rate ~22% Models tested GPT-4o-mini, Gemini-2.5-Flash Fine-tuning impact No reliable improvement

Additionally, the researchers investigated whether fine-tuning LLMs on structured logic puzzles would transfer to improved in-game reasoning. They found that fine-tuning did not reliably improve performance and, in some cases, appeared to increase reasoning volume without improving reasoning precision.

How It Works

The Clue game environment serves as a controlled testbed for evaluating several aspects of LLM reasoning:

Figure 2: Mind Bender fine-tuning example (adopted from original).

Multi-step deduction: Agents must track information across multiple turns, remembering which cards have been shown by other players and which combinations have been eliminated.
Strategic planning: Beyond pure deduction, agents must decide when to make suggestions (which reveals information to other players) versus when to make an accusation (which ends the game if correct).
Rule following: The environment has strict game rules that agents must adhere to, testing their ability to parse and follow structured instructions.
Multi-agent interaction: Unlike single-agent puzzle solving, this setup requires reasoning about other agents' knowledge states and potential strategies.

The researchers implemented the game using a rule-based system that validates agent actions and maintains game state. Agents receive text descriptions of the current game state and must output valid game actions in natural language.

Why It Matters

This research provides concrete evidence that current LLMs struggle with sustained, multi-step logical reasoning in interactive environments. While LLMs have shown impressive performance on many benchmarks, this study reveals specific weaknesses in:

Figure 1: Clue gameplay diagram illustrating Player 1’s turn. In this example, Player 1 made a suggestion (Mrs. White, K

Long-horizon reasoning: Maintaining logical consistency over extended sequences of actions and observations
Strategic decision-making: Balancing information gathering versus information revelation in competitive settings
Knowledge integration: Combining new information with existing knowledge to update beliefs systematically

The finding that fine-tuning on logic puzzles doesn't reliably transfer to improved gameplay is particularly significant. It suggests that current fine-tuning approaches may teach LLMs to produce more reasoning-like text without actually improving their underlying reasoning capabilities—a form of "reasoning theater" rather than genuine logical improvement.

This work contributes to the growing body of research examining the limitations of LLMs in complex reasoning tasks, moving beyond static question-answering to dynamic, interactive environments that better simulate real-world reasoning scenarios.

Source: gentic.news · Mar 19, 2026 · author=Ala SMITH · citation.json

AI-assisted reporting. Generated by gentic.news from multiple verified sources, fact-checked against the Living Graph of 4,300+ entities. Edited by Ala SMITH.

Following this story?

Get a weekly digest with AI predictions, trends, and analysis — free.

AI Analysis

This study represents a well-designed evaluation of LLM reasoning capabilities in a controlled but realistic setting. The Clue game environment is particularly effective because it requires both deductive reasoning (eliminating possibilities based on evidence) and strategic reasoning (managing information revelation in a competitive setting). The 22% win rate is strikingly low given that random guessing would yield approximately 1-in-324 odds (0.3%), suggesting the models are doing better than chance but far from competent gameplay. The finding that fine-tuning doesn't reliably improve performance—and can even degrade it by increasing verbosity without precision—aligns with recent concerns about "reasoning mimickry" in LLMs. Models may learn to produce reasoning-like text patterns from their training data without developing actual logical reasoning capabilities. This has implications for how we evaluate and improve LLM reasoning: we need benchmarks that distinguish between the appearance of reasoning and genuine logical capability. Practitioners should note that this study used mid-tier models (GPT-4o-mini and Gemini-2.5-Flash), not the largest available models. It would be valuable to see how top-tier models like GPT-4o, Claude 3.5, or Gemini Ultra perform on this task. However, the fundamental challenge—maintaining consistent logical state over extended interactions—likely persists even with larger models, as this relates to architectural limitations of current transformer-based LLMs rather than mere scale.

#large-language-models #reasoning #research #evaluation

Mentioned in this article

Multi-step deductive reasoning GPT-4o-mini Clue arXiv

Enjoyed this article?