A new benchmark paper, SocialGrid, reveals that even the largest open-source language models struggle with fundamental planning and social reasoning when deployed as autonomous agents in a multi-agent environment. The benchmark, inspired by the social deduction game Among Us, shows that the strongest tested open model, GPT-OSS-120B, achieves below 60% accuracy in both task completion and planning, with agents frequently getting stuck in repetitive loops or failing to navigate basic obstacles.
The core finding is that social reasoning remains a severe bottleneck. Even when provided with a "Planning Oracle" to isolate navigation failures, agents fail to detect deception at near-random chance levels, regardless of model scale. This suggests current LLM agents rely on shallow heuristics rather than accumulating and reasoning over behavioral evidence.
Key Takeaways
- Researchers introduced SocialGrid, a multi-agent benchmark inspired by Among Us.
- It shows state-of-the-art LLMs fail at deception detection and task planning, scoring below 60% accuracy.
What the Researchers Built
SocialGrid is an embodied multi-agent grid-world environment designed to evaluate LLMs on three intertwined capabilities:
- Planning & Navigation: Agents must navigate a map, avoid obstacles, and reach specific locations to complete tasks.
- Task Execution: Agents perform a sequence of actions (e.g., "fix wiring") to achieve objectives.
- Social Reasoning: In adversarial settings, some agents are designated "deceivers" who must sabotage tasks while maintaining a credible innocent facade. Other "honest" agents must complete tasks while identifying the deceivers through observation and discussion.
The environment is fully automated, providing fine-grained metrics and automatic failure analysis to help developers diagnose specific agent weaknesses.
Key Results: Planning Fails, Social Reasoning Fails Harder
The evaluation paints a stark picture of current LLM agent capabilities.

The optional Planning Oracle is a critical design feature. It provides perfect navigation and low-level action planning, allowing researchers to disentangle failures in physical planning from failures in higher-level social strategy. The results confirm that while planning is a major problem, it is not the only one—social intelligence is a distinct and significant weakness.
How It Works: From Game Rules to Elo Ratings
The benchmark operationalizes social deduction into a structured simulation. Each episode involves 4-6 agents. Honest agents have a list of tasks to complete; deceivers have a list of tasks to sabotage. All agents can move, interact with objects, and communicate in a natural language chat channel.

After a timed period, agents enter a "discussion phase" where they must argue, accuse, and defend themselves before voting to eject a suspected deceiver. Performance is measured through:
- Task Completion Rate: Percentage of tasks completed by honest agents.
- Deceiver Identification Accuracy: How often honest agents correctly vote out a deceiver.
- Deceiver Survival Rate: How often deceivers avoid being ejected.
- Adversarial Elo Ratings: The researchers establish a competitive leaderboard by running leagues of agents against each other, calculating Elo scores to rank their overall strategic prowess.
The automatic failure analysis categorizes common errors, such as "navigation loop," "task misunderstanding," or "failed deception bluff," providing actionable diagnostics for model developers.
Why It Matters: A Reality Check for Autonomous Agents
SocialGrid provides a much-needed reality check for the field of LLM-based autonomous agents. As noted in our recent coverage, Your AI Agent Is Only as Good as Its Harness, the infrastructure and evaluation frameworks around agents are as important as the models themselves. This benchmark shows that simply scaling up language models does not confer robust, embodied social intelligence.

The near-random performance in deception detection indicates that LLMs lack a core theory of mind—the ability to model others' beliefs, intentions, and knowledge states to predict behavior. This is a fundamental requirement for agents operating in any collaborative or competitive human environment.
gentic.news Analysis
This research directly intersects with several critical trends we've been tracking. First, it provides a concrete, empirical counterpoint to the speculative AGI discussions that have proliferated, such as the recent proposal for an "Artificial Scientist" AGI definition. SocialGrid shows that before achieving such lofty goals, models must first master basic multi-agent social schemas and planning.
Second, the findings on deceptive behavior align with and extend our report on research suggesting LLMs can 'lie'. SocialGrid moves beyond single-agent propensity to lie and tests the multi-agent dynamics of sustained deception and detection. The failure here is systemic.
Finally, the persistent planning failures echo concerns raised in other embodied AI research. The need for a "Planning Oracle" to even evaluate social reasoning underscores that navigation and action sequencing remain unsolved problems for LLM agents, complicating their deployment in any physical or simulated environment. This benchmark sets a clear, challenging target for the next generation of agent frameworks: models must be evaluated not just on chat, but on their integrated ability to plan, act, and reason socially in a dynamic world.
Frequently Asked Questions
What is the SocialGrid benchmark?
SocialGrid is a simulated multi-agent environment, like a simplified version of the game Among Us, designed to test Large Language Model (LLM) agents on three key skills: navigating a space and planning actions (planning), completing objectives (task execution), and figuring out who is lying in a group (social reasoning). It provides automated scoring and detailed failure reports.
How badly do current LLMs perform on SocialGrid?
Performance is poor. The best open-source model tested (GPT-OSS-120B) scored below 60% accuracy on basic task completion and planning. When tested on the core social reasoning task of identifying a liar, agents performed at near random chance levels, showing no improvement even with larger models.
What is the "Planning Oracle" used for?
The Planning Oracle is a tool within SocialGrid that gives agents perfect navigation and low-level action control. Researchers use it to isolate failures. If an agent with perfect planning still can't identify a deceiver, the failure is definitively in social reasoning, not in getting lost or stuck. Results confirm social reasoning is a major, independent bottleneck.
Why is this benchmark important for AI development?
As companies rush to build AI agents that can operate autonomously in the real world (e.g., in customer service, games, or robotics), they need rigorous tests. SocialGrid shows that today's powerful LLMs still fail at basic integrated skills like planning and social deduction. It provides a standardized way to measure progress and diagnose specific weaknesses that need fixing before reliable, trustworthy multi-agent systems are possible.









