Current LLM agent frameworks are built for a single boss. Deploy them into a real team—with multiple stakeholders, conflicting goals, and private information—and they fall apart. New research formalizes this multi-user interaction problem and introduces Muses-Bench, a benchmark testing three critical scenarios: instruction following under authority conflicts, cross-user access control, and multi-user meeting coordination.
The results are sobering. Even the top-performing model, Google's Gemini 3 Pro, averages just 85.6% across tasks. On the complex meeting coordination scenario—where an agent must synthesize inputs from multiple users with varying authority—no tested model exceeds a 64.8% success rate. The privacy-utility tradeoff is especially brutal: models like xAI's Grok-3-Mini score near-perfect on privacy (99.6%) but tank on utility (60.1%).
What the Benchmark Tests
Muses-Bench formalizes the multi-principal decision problem for AI agents. Instead of a single instruction stream, agents receive requests from multiple users with different levels of authority, private information they shouldn't share, and potentially conflicting goals. The benchmark evaluates three core scenarios:
- Instruction Following with Authority Conflicts: The agent receives contradictory instructions from users with defined hierarchical authority (e.g., a manager vs. a colleague).
- Cross-User Access Control: The agent must manage and respect information privacy boundaries between users, deciding what to share and with whom.
- Multi-User Meeting Coordination: The agent acts as a coordinator in a simulated meeting, integrating inputs, resolving conflicts, and maintaining action items based on user roles and authority.
Key Results: Models Aren't Ready for Team Deployment
The paper tests several leading models, revealing significant gaps. Performance is measured across correctness, privacy adherence, and utility.
Gemini 3 Pro 85.6% 64.8% 95.2% 88.3% GPT-4o 83.1% 61.7% 93.8% 85.1% Claude 3.5 Sonnet 81.9% 59.4% 91.5% 82.0% Grok-3-Mini 79.8% 58.1% 99.6% 60.1% Llama 3.1 405B 77.4% 55.9% 90.1% 78.9%The Privacy-Utility Wall: The results for Grok-3-Mini highlight a fundamental tension. By being overly conservative with private data, it achieves near-perfect privacy scores but fails to provide useful outputs, creating a functionally useless agent. Other models leak more privacy over multi-turn interactions as context builds.
Authority Blindness: Models consistently struggle to stably prioritize instructions based on user authority over extended conversations, often flip-flopping or defaulting to the most recent input regardless of source.
Why This Matters for Real Deployment
The research underscores a critical mismatch between AI agent development and real-world use. Most frameworks (AutoGPT, LangChain, CrewAI) assume a single user. However, the target deployment environments—organizational Slack/Discord bots, shared project management tools, customer service triage systems, and internal copilots—are inherently multi-user.
As lead researcher Omar Sarhan notes, "As agents move into organizational tools... multi-principal conflicts become the default, not the exception. Current models aren't ready."
The failure modes are practical and risky: agents leaking confidential salary data between colleagues, a junior employee's request incorrectly overriding a director's priority, or a meeting coordinator bot failing to synthesize action items, rendering it a costly distraction.
How It Works: Formalizing the Multi-Principal Problem
The research team approached the challenge by framing it as a multi-principal decision problem from economic theory, where an "agent" (the LLM) serves multiple "principals" (users). Each user has a private type (information), a utility function (goal), and an authority level. The LLM agent's policy must map the history of messages from all users to an action (response/decision).
Muses-Bench implements this by generating thousands of multi-turn conversations with programmed user profiles, ground-truth private data, and authority graphs. The evaluation automatically checks for correctness of task execution, privacy violations (leaking user A's private info to user B), and overall utility of the agent's output.
The finding that performance degrades over multiple turns suggests models lack a persistent, internal representation of the multi-user state, treating each interaction more like an isolated Q&A.
gentic.news Analysis
This research directly addresses a growing pain point in enterprise AI adoption that we first highlighted in our 2025 coverage of Microsoft's Team Copilot rollout. At launch, we noted that the shift from a personal Copilot to a shared team agent introduced uncharted behavioral and security questions. This benchmark provides the first rigorous methodology to quantify those exact problems, validating early adopter concerns with hard data.
The poor performance on meeting coordination (<65% for all models) is particularly telling. It aligns with persistent user feedback about existing AI "meeting scribe" tools from vendors like Zoom and Otter.ai, which summarize well but fail at synthesis, action item attribution, and conflict resolution—the very tasks that require multi-user understanding. This gap creates a market opportunity for startups like Sembly and Fellow.app, which are increasingly layering LLM features atop human-in-the-loop workflows.
The stark privacy-utility tradeoff exposed here has immediate implications for the AI governance and compliance sector. Companies like Credo AI and Lakera are building tools to enforce guardrails, but this research shows the problem is not just about adding filters; it's a core reasoning failure. An agent that cannot reason about why information is private in a specific multi-user context will either be useless or unsafe. This suggests future solutions may require novel architectures that explicitly model user relationships and information boundaries, moving beyond simple prompt engineering or post-hoc filtering.
Frequently Asked Questions
What is a multi-principal problem for AI agents?
It's a scenario where a single AI agent receives instructions and information from multiple users (principals) with potentially conflicting goals, different levels of authority, and private data that shouldn't be shared between them. The agent must navigate these conflicts to serve the collective or authorized interest, a challenge most current LLM-based agents are not designed to handle.
Which AI model performed best on the Muses-Bench?
Google's Gemini 3 Pro achieved the highest average score of 85.6% across all tasks in the benchmark. However, its performance on the complex meeting coordination scenario was still relatively low at 64.8%, indicating that even the best current model struggles significantly with multi-user synthesis and conflict resolution.
Why is the privacy-utility tradeoff so difficult for LLM agents?
LLMs are typically trained to be helpful and informative, which can conflict with the need to withhold private information. In a multi-user setting, an agent must dynamically understand contextual privacy—why a piece of data is private to User A relative to User B. Without explicit training or architectural mechanisms for this, models tend to fail in one of two ways: being overly conservative (high privacy, low utility) or inadvertently leaking information while trying to be helpful.
How can developers build better multi-user agents?
The research suggests that simply using a more capable base LLM is insufficient. Developers likely need to implement explicit frameworks that: 1) Formally model user authority and relationships, 2) Maintain a persistent state of information permissions across conversation turns, and 3) Incorporate training or fine-tuning on multi-principal decision-making datasets like Muses-Bench. Architectural changes, such as separate modules for user modeling and privacy reasoning, may be necessary beyond prompt engineering.









