LLMs Score Only 22% Win Rate in Multi-Agent Clue Game, Revealing Deductive Reasoning Gaps
Researchers have implemented a text-based, multi-agent version of the classic board game Clue as a rule-based testbed for evaluating multi-step deductive reasoning in large language models. The study, published on arXiv, reveals significant challenges for current LLM agents in maintaining consistent logical reasoning over extended interactions.
What the Researchers Built
The team created a fully automated, text-based environment simulating the deduction game Clue (also known as Cluedo). In this implementation, six LLM-based agents compete to solve a murder mystery by determining the correct combination of suspect, weapon, and room through strategic questioning and logical elimination.
Each game involves:
- A randomly generated solution (one suspect, one weapon, one room)
- Six AI agents, each with unique secret cards
- Turn-based gameplay where agents move between rooms, make suggestions, and must deduce the solution
- Full text-based interaction where agents receive game state information and must output valid game actions
The researchers used agents drawn from two model families: GPT-4o-mini and Gemini-2.5-Flash, representing current mid-tier commercial LLMs.
Key Results
Across 18 simulated games with varying random seeds, the LLM agents achieved only four correct wins—a success rate of approximately 22%. This low performance indicates fundamental difficulties in maintaining consistent deductive reasoning throughout a complete game session.

Additionally, the researchers investigated whether fine-tuning LLMs on structured logic puzzles would transfer to improved in-game reasoning. They found that fine-tuning did not reliably improve performance and, in some cases, appeared to increase reasoning volume without improving reasoning precision.
How It Works
The Clue game environment serves as a controlled testbed for evaluating several aspects of LLM reasoning:

Multi-step deduction: Agents must track information across multiple turns, remembering which cards have been shown by other players and which combinations have been eliminated.
Strategic planning: Beyond pure deduction, agents must decide when to make suggestions (which reveals information to other players) versus when to make an accusation (which ends the game if correct).
Rule following: The environment has strict game rules that agents must adhere to, testing their ability to parse and follow structured instructions.
Multi-agent interaction: Unlike single-agent puzzle solving, this setup requires reasoning about other agents' knowledge states and potential strategies.
The researchers implemented the game using a rule-based system that validates agent actions and maintains game state. Agents receive text descriptions of the current game state and must output valid game actions in natural language.
Why It Matters
This research provides concrete evidence that current LLMs struggle with sustained, multi-step logical reasoning in interactive environments. While LLMs have shown impressive performance on many benchmarks, this study reveals specific weaknesses in:

- Long-horizon reasoning: Maintaining logical consistency over extended sequences of actions and observations
- Strategic decision-making: Balancing information gathering versus information revelation in competitive settings
- Knowledge integration: Combining new information with existing knowledge to update beliefs systematically
The finding that fine-tuning on logic puzzles doesn't reliably transfer to improved gameplay is particularly significant. It suggests that current fine-tuning approaches may teach LLMs to produce more reasoning-like text without actually improving their underlying reasoning capabilities—a form of "reasoning theater" rather than genuine logical improvement.
This work contributes to the growing body of research examining the limitations of LLMs in complex reasoning tasks, moving beyond static question-answering to dynamic, interactive environments that better simulate real-world reasoning scenarios.





