A new paper from UC Berkeley and Stanford finds multi-agent LLM systems score 2.3% lower on reasoning benchmarks than single models. The study tested five architectures across 10 benchmarks.
Key facts
- 2.3% average accuracy drop for multi-agent vs single model.
- 10 reasoning benchmarks tested including GSM8K, MATH, ARC-Challenge.
- 5 multi-agent architectures evaluated: debate, reflection, RAG, etc.
- Single GPT-4o scored 92.4% on GSM8K; best multi-agent scored 91.1%.
- Study used GPT-4o, Claude 3.5 Sonnet, Llama 3 70B.
Most AI developers assume that throwing multiple LLM agents at a reasoning problem improves accuracy. A new paper by researchers at UC Berkeley and Stanford challenges this assumption, finding that multi-agent systems, on average, score 2.3% lower on reasoning benchmarks than a single, well-tuned model. [According to @dair_ai via @omarsar0]
The study tested five multi-agent architectures—including debate, reflection, and retrieval-augmented generation—across 10 reasoning benchmarks such as GSM8K, MATH, and ARC-Challenge. On GSM8K, a single GPT-4o scored 92.4% accuracy, while the best multi-agent variant, a two-agent debate setup, scored 91.1%. The gap widened on harder tasks: on MATH, single-model accuracy was 76.8% versus 74.2% for the multi-agent ensemble.
The unique take here is that the paper's results contradict the prevailing narrative in AI engineering that more agents equal better reasoning. The authors note that the overhead of coordinating multiple agents—token costs, latency, and error propagation—outweighs any marginal benefit. The paper suggests that multi-agent setups introduce latency and error propagation without commensurate gains. This echoes recent findings from Anthropic and Google DeepMind that simpler architectures often match or exceed complex pipelines.
The researchers also controlled for model size and prompting strategy, ensuring that the single-model baseline was not artificially weak. They used identical base models (GPT-4o, Claude 3.5 Sonnet, Llama 3 70B) across all conditions. The results held across model families, suggesting the finding is general rather than model-specific.
One limitation: the paper did not test multi-agent systems on open-ended tasks like creative writing or code generation, where collaboration might add value. The authors call for future work on task-specific multi-agent design.
What to watch
Watch for follow-up papers that test multi-agent systems on code generation and creative tasks, where the collaboration hypothesis may still hold. Also monitor whether AI engineering teams at OpenAI and Anthropic adjust their agent product roadmaps in response.









