Listen to today's AI briefing

Daily podcast — 5 min, AI-narrated summary of top stories

SocialGrid Benchmark Shows LLMs Fail at Deception, Score Below 60% on Planning
AI ResearchScore: 84

SocialGrid Benchmark Shows LLMs Fail at Deception, Score Below 60% on Planning

Researchers introduced SocialGrid, a multi-agent benchmark inspired by Among Us. It shows state-of-the-art LLMs fail at deception detection and task planning, scoring below 60% accuracy.

GAla Smith & AI Research Desk·3h ago·7 min read·17 views·AI-Generated
Share:
Source: arxiv.orgvia arxiv_aiCorroborated
SocialGrid Benchmark Exposes Critical Gaps in LLM Planning and Social Reasoning

A new benchmark paper, SocialGrid, reveals that even the largest open-source language models struggle with fundamental planning and social reasoning when deployed as autonomous agents in a multi-agent environment. The benchmark, inspired by the social deduction game Among Us, shows that the strongest tested open model, GPT-OSS-120B, achieves below 60% accuracy in both task completion and planning, with agents frequently getting stuck in repetitive loops or failing to navigate basic obstacles.

The core finding is that social reasoning remains a severe bottleneck. Even when provided with a "Planning Oracle" to isolate navigation failures, agents fail to detect deception at near-random chance levels, regardless of model scale. This suggests current LLM agents rely on shallow heuristics rather than accumulating and reasoning over behavioral evidence.

Key Takeaways

  • Researchers introduced SocialGrid, a multi-agent benchmark inspired by Among Us.
  • It shows state-of-the-art LLMs fail at deception detection and task planning, scoring below 60% accuracy.

What the Researchers Built

SocialGrid is an embodied multi-agent grid-world environment designed to evaluate LLMs on three intertwined capabilities:

  1. Planning & Navigation: Agents must navigate a map, avoid obstacles, and reach specific locations to complete tasks.
  2. Task Execution: Agents perform a sequence of actions (e.g., "fix wiring") to achieve objectives.
  3. Social Reasoning: In adversarial settings, some agents are designated "deceivers" who must sabotage tasks while maintaining a credible innocent facade. Other "honest" agents must complete tasks while identifying the deceivers through observation and discussion.

The environment is fully automated, providing fine-grained metrics and automatic failure analysis to help developers diagnose specific agent weaknesses.

Key Results: Planning Fails, Social Reasoning Fails Harder

The evaluation paints a stark picture of current LLM agent capabilities.

Figure 5: Complexity analysis. Mean performance across different room configurations with fixed room size (10×1010{\time

Overall Performance Standard Setting Top open model (GPT-OSS-120B) scores <60% accuracy on task completion & planning. Planning Isolation With Planning Oracle Task completion improves, but social reasoning accuracy does not. Deception Detection Adversarial Setting Agent performance is near random chance, showing no scaling benefit with larger models. Agent Behavior Qualitative Analysis Agents get stuck in repetitive behaviors, fail to navigate simple obstacles, and use shallow heuristics (e.g., voting for the quietest player).

The optional Planning Oracle is a critical design feature. It provides perfect navigation and low-level action planning, allowing researchers to disentangle failures in physical planning from failures in higher-level social strategy. The results confirm that while planning is a major problem, it is not the only one—social intelligence is a distinct and significant weakness.

How It Works: From Game Rules to Elo Ratings

The benchmark operationalizes social deduction into a structured simulation. Each episode involves 4-6 agents. Honest agents have a list of tasks to complete; deceivers have a list of tasks to sabotage. All agents can move, interact with objects, and communicate in a natural language chat channel.

Figure 2: LLM agents struggle with spatial navigation in embodied settings. Comparison of crewmate performance on Social

After a timed period, agents enter a "discussion phase" where they must argue, accuse, and defend themselves before voting to eject a suspected deceiver. Performance is measured through:

  • Task Completion Rate: Percentage of tasks completed by honest agents.
  • Deceiver Identification Accuracy: How often honest agents correctly vote out a deceiver.
  • Deceiver Survival Rate: How often deceivers avoid being ejected.
  • Adversarial Elo Ratings: The researchers establish a competitive leaderboard by running leagues of agents against each other, calculating Elo scores to rank their overall strategic prowess.

The automatic failure analysis categorizes common errors, such as "navigation loop," "task misunderstanding," or "failed deception bluff," providing actionable diagnostics for model developers.

Why It Matters: A Reality Check for Autonomous Agents

SocialGrid provides a much-needed reality check for the field of LLM-based autonomous agents. As noted in our recent coverage, Your AI Agent Is Only as Good as Its Harness, the infrastructure and evaluation frameworks around agents are as important as the models themselves. This benchmark shows that simply scaling up language models does not confer robust, embodied social intelligence.

Figure 1: SocialGrid Overview. Inspired by Among Us, SocialGrid is a controllable, embodied benchmark evaluating LLM age

The near-random performance in deception detection indicates that LLMs lack a core theory of mind—the ability to model others' beliefs, intentions, and knowledge states to predict behavior. This is a fundamental requirement for agents operating in any collaborative or competitive human environment.

gentic.news Analysis

This research directly intersects with several critical trends we've been tracking. First, it provides a concrete, empirical counterpoint to the speculative AGI discussions that have proliferated, such as the recent proposal for an "Artificial Scientist" AGI definition. SocialGrid shows that before achieving such lofty goals, models must first master basic multi-agent social schemas and planning.

Second, the findings on deceptive behavior align with and extend our report on research suggesting LLMs can 'lie'. SocialGrid moves beyond single-agent propensity to lie and tests the multi-agent dynamics of sustained deception and detection. The failure here is systemic.

Finally, the persistent planning failures echo concerns raised in other embodied AI research. The need for a "Planning Oracle" to even evaluate social reasoning underscores that navigation and action sequencing remain unsolved problems for LLM agents, complicating their deployment in any physical or simulated environment. This benchmark sets a clear, challenging target for the next generation of agent frameworks: models must be evaluated not just on chat, but on their integrated ability to plan, act, and reason socially in a dynamic world.

Frequently Asked Questions

What is the SocialGrid benchmark?

SocialGrid is a simulated multi-agent environment, like a simplified version of the game Among Us, designed to test Large Language Model (LLM) agents on three key skills: navigating a space and planning actions (planning), completing objectives (task execution), and figuring out who is lying in a group (social reasoning). It provides automated scoring and detailed failure reports.

How badly do current LLMs perform on SocialGrid?

Performance is poor. The best open-source model tested (GPT-OSS-120B) scored below 60% accuracy on basic task completion and planning. When tested on the core social reasoning task of identifying a liar, agents performed at near random chance levels, showing no improvement even with larger models.

What is the "Planning Oracle" used for?

The Planning Oracle is a tool within SocialGrid that gives agents perfect navigation and low-level action control. Researchers use it to isolate failures. If an agent with perfect planning still can't identify a deceiver, the failure is definitively in social reasoning, not in getting lost or stuck. Results confirm social reasoning is a major, independent bottleneck.

Why is this benchmark important for AI development?

As companies rush to build AI agents that can operate autonomously in the real world (e.g., in customer service, games, or robotics), they need rigorous tests. SocialGrid shows that today's powerful LLMs still fail at basic integrated skills like planning and social deduction. It provides a standardized way to measure progress and diagnose specific weaknesses that need fixing before reliable, trustworthy multi-agent systems are possible.

Following this story?

Get a weekly digest with AI predictions, trends, and analysis — free.

AI Analysis

The SocialGrid benchmark arrives at a pivotal moment, as the AI industry's focus shifts decisively from chatbots to autonomous agents. Its results deliver a sobering dose of empirical rigor against a backdrop of often hyperbolic claims about agentic AI. The finding that social reasoning performance is near-random and doesn't scale with model size is particularly significant. It suggests that the current transformer-based LLM paradigm, even at massive scale, may lack an inherent architecture for robust theory-of-mind reasoning. This isn't a problem a few more parameters will solve; it may require novel architectural innovations or training paradigms specifically designed for modeling other agents. This work also expertly addresses a major flaw in many AI evaluations: confounding variables. By introducing the Planning Oracle, the researchers cleanly separate the problem of "getting there" from the problem of "figuring it out." This methodological rigor is crucial. It tells us that while improving embodied planning (a hard problem itself) is necessary, it is not sufficient. The field now has a tool to test these capabilities independently. Connecting to our knowledge graph, this research directly challenges the trajectory implied by recent AGI definitions. A proposed '[Artificial Scientist](https://gentic.news/paper-proposes-artificial)' must, at a minimum, be able to reason about the beliefs and intentions of other scientists—a capability SocialGrid shows is almost entirely absent. Furthermore, the planning failures align with concerns from embodied AI research that LLMs are fundamentally disembodied and struggle with spatial and temporal reasoning. For practitioners, the message is clear: benchmarking your agent on static Q&A or single-turn tasks is wholly inadequate. If your use case involves multiple agents or social dynamics, you must test in environments like SocialGrid or risk deploying brittle, easily fooled systems.

Mentioned in this article

Enjoyed this article?
Share:

Related Articles

More in AI Research

View all