Listen to today's AI briefing

Daily podcast — 5 min, AI-narrated summary of top stories

Multi-User LLM Agents Struggle: Gemini 3 Pro Scores 85.6% on Muses-Bench
AI ResearchScore: 92

Multi-User LLM Agents Struggle: Gemini 3 Pro Scores 85.6% on Muses-Bench

A new benchmark reveals LLMs struggle with multi-user scenarios where agents face conflicting instructions. Gemini 3 Pro leads but only achieves 85.6% average, with privacy-utility tradeoffs proving particularly difficult.

GAla Smith & AI Research Desk·6h ago·7 min read·13 views·AI-Generated
Share:
Multi-User LLM Agents Fail at Coordination: Gemini 3 Pro Scores 85.6% on New Muses-Bench

Current LLM agent frameworks are built for a single boss. Deploy them into a real team—with multiple stakeholders, conflicting goals, and private information—and they fall apart. New research formalizes this multi-user interaction problem and introduces Muses-Bench, a benchmark testing three critical scenarios: instruction following under authority conflicts, cross-user access control, and multi-user meeting coordination.

The results are sobering. Even the top-performing model, Google's Gemini 3 Pro, averages just 85.6% across tasks. On the complex meeting coordination scenario—where an agent must synthesize inputs from multiple users with varying authority—no tested model exceeds a 64.8% success rate. The privacy-utility tradeoff is especially brutal: models like xAI's Grok-3-Mini score near-perfect on privacy (99.6%) but tank on utility (60.1%).

What the Benchmark Tests

Muses-Bench formalizes the multi-principal decision problem for AI agents. Instead of a single instruction stream, agents receive requests from multiple users with different levels of authority, private information they shouldn't share, and potentially conflicting goals. The benchmark evaluates three core scenarios:

  1. Instruction Following with Authority Conflicts: The agent receives contradictory instructions from users with defined hierarchical authority (e.g., a manager vs. a colleague).
  2. Cross-User Access Control: The agent must manage and respect information privacy boundaries between users, deciding what to share and with whom.
  3. Multi-User Meeting Coordination: The agent acts as a coordinator in a simulated meeting, integrating inputs, resolving conflicts, and maintaining action items based on user roles and authority.

Key Results: Models Aren't Ready for Team Deployment

The paper tests several leading models, revealing significant gaps. Performance is measured across correctness, privacy adherence, and utility.

Gemini 3 Pro 85.6% 64.8% 95.2% 88.3% GPT-4o 83.1% 61.7% 93.8% 85.1% Claude 3.5 Sonnet 81.9% 59.4% 91.5% 82.0% Grok-3-Mini 79.8% 58.1% 99.6% 60.1% Llama 3.1 405B 77.4% 55.9% 90.1% 78.9%

The Privacy-Utility Wall: The results for Grok-3-Mini highlight a fundamental tension. By being overly conservative with private data, it achieves near-perfect privacy scores but fails to provide useful outputs, creating a functionally useless agent. Other models leak more privacy over multi-turn interactions as context builds.

Authority Blindness: Models consistently struggle to stably prioritize instructions based on user authority over extended conversations, often flip-flopping or defaulting to the most recent input regardless of source.

Why This Matters for Real Deployment

The research underscores a critical mismatch between AI agent development and real-world use. Most frameworks (AutoGPT, LangChain, CrewAI) assume a single user. However, the target deployment environments—organizational Slack/Discord bots, shared project management tools, customer service triage systems, and internal copilots—are inherently multi-user.

As lead researcher Omar Sarhan notes, "As agents move into organizational tools... multi-principal conflicts become the default, not the exception. Current models aren't ready."

The failure modes are practical and risky: agents leaking confidential salary data between colleagues, a junior employee's request incorrectly overriding a director's priority, or a meeting coordinator bot failing to synthesize action items, rendering it a costly distraction.

How It Works: Formalizing the Multi-Principal Problem

The research team approached the challenge by framing it as a multi-principal decision problem from economic theory, where an "agent" (the LLM) serves multiple "principals" (users). Each user has a private type (information), a utility function (goal), and an authority level. The LLM agent's policy must map the history of messages from all users to an action (response/decision).

Muses-Bench implements this by generating thousands of multi-turn conversations with programmed user profiles, ground-truth private data, and authority graphs. The evaluation automatically checks for correctness of task execution, privacy violations (leaking user A's private info to user B), and overall utility of the agent's output.

The finding that performance degrades over multiple turns suggests models lack a persistent, internal representation of the multi-user state, treating each interaction more like an isolated Q&A.

gentic.news Analysis

This research directly addresses a growing pain point in enterprise AI adoption that we first highlighted in our 2025 coverage of Microsoft's Team Copilot rollout. At launch, we noted that the shift from a personal Copilot to a shared team agent introduced uncharted behavioral and security questions. This benchmark provides the first rigorous methodology to quantify those exact problems, validating early adopter concerns with hard data.

The poor performance on meeting coordination (<65% for all models) is particularly telling. It aligns with persistent user feedback about existing AI "meeting scribe" tools from vendors like Zoom and Otter.ai, which summarize well but fail at synthesis, action item attribution, and conflict resolution—the very tasks that require multi-user understanding. This gap creates a market opportunity for startups like Sembly and Fellow.app, which are increasingly layering LLM features atop human-in-the-loop workflows.

The stark privacy-utility tradeoff exposed here has immediate implications for the AI governance and compliance sector. Companies like Credo AI and Lakera are building tools to enforce guardrails, but this research shows the problem is not just about adding filters; it's a core reasoning failure. An agent that cannot reason about why information is private in a specific multi-user context will either be useless or unsafe. This suggests future solutions may require novel architectures that explicitly model user relationships and information boundaries, moving beyond simple prompt engineering or post-hoc filtering.

Frequently Asked Questions

What is a multi-principal problem for AI agents?

It's a scenario where a single AI agent receives instructions and information from multiple users (principals) with potentially conflicting goals, different levels of authority, and private data that shouldn't be shared between them. The agent must navigate these conflicts to serve the collective or authorized interest, a challenge most current LLM-based agents are not designed to handle.

Which AI model performed best on the Muses-Bench?

Google's Gemini 3 Pro achieved the highest average score of 85.6% across all tasks in the benchmark. However, its performance on the complex meeting coordination scenario was still relatively low at 64.8%, indicating that even the best current model struggles significantly with multi-user synthesis and conflict resolution.

Why is the privacy-utility tradeoff so difficult for LLM agents?

LLMs are typically trained to be helpful and informative, which can conflict with the need to withhold private information. In a multi-user setting, an agent must dynamically understand contextual privacy—why a piece of data is private to User A relative to User B. Without explicit training or architectural mechanisms for this, models tend to fail in one of two ways: being overly conservative (high privacy, low utility) or inadvertently leaking information while trying to be helpful.

How can developers build better multi-user agents?

The research suggests that simply using a more capable base LLM is insufficient. Developers likely need to implement explicit frameworks that: 1) Formally model user authority and relationships, 2) Maintain a persistent state of information permissions across conversation turns, and 3) Incorporate training or fine-tuning on multi-principal decision-making datasets like Muses-Bench. Architectural changes, such as separate modules for user modeling and privacy reasoning, may be necessary beyond prompt engineering.

Following this story?

Get a weekly digest with AI predictions, trends, and analysis — free.

AI Analysis

This work is a critical reality check for the AI agent ecosystem. For the past two years, development has focused on single-user capabilities: coding assistants, personal writing copilots, and solo research agents. Muses-Bench formally identifies the next major hurdle—social intelligence and contextual governance—that must be cleared for agents to move from personal productivity tools to organizational infrastructure. The sub-65% scores on meeting coordination aren't just a low number; they indicate a fundamental lack of reasoning about social hierarchy and conflict resolution, capabilities that are innate to human team members. The benchmark's design cleverly isolates specific failure modes. The privacy-utility divergence shown by Grok-3-Mini versus other models isn't merely a performance difference; it represents two catastrophic failure states for deployment. An agent that is 99.6% private but 60.1% useful is a cost center that provides no ROI. An agent that is 88% useful but leaks private data 5-10% of the time is a legal and reputational liability. This creates a clear R&D target: models need to climb the **Pareto frontier** of this trade-off, not just optimize for one side. Practitioners should view this as a mandatory evaluation step before deploying any shared agent. The days of testing an agent with a single persona are over. The roadmap now involves stress-testing agents with simulated multi-user scenarios that include authority conflicts and private data. This research also suggests a shift in fine-tuning strategy. Instead of just tuning for correctness on single-user tasks, teams will need to curate datasets of multi-party dialogues with annotated authority, privacy flags, and optimal resolutions—a complex and costly data challenge that may define the next phase of enterprise AI.
Enjoyed this article?
Share:

Related Articles

More in AI Research

View all