The Limits of Crowd Wisdom: Why Polling Multiple LLMs Doesn't Guarantee Truth
A groundbreaking study from arXiv, published on February 20, 2026, challenges a fundamental assumption in artificial intelligence deployment: that aggregating responses from multiple large language models (LLMs) leads to more truthful outputs. The paper, titled "Consensus is Not Verification: Why Crowd Wisdom Strategies Fail for LLM Truthfulness," reveals that scaling inference compute through polling-style aggregation yields no consistent accuracy gains in domains without external verification, despite working well in mathematics and code generation.
The Failed Promise of Ensemble Methods
In domains like mathematics and programming, techniques such as Pass@k—which generate multiple candidate solutions and filter them through external verifiers—have proven remarkably effective. This success naturally led researchers to wonder if similar approaches could improve truthfulness in open-ended domains like factual question answering, where external verification is impractical.
The research team tested this hypothesis across five different benchmarks using various language models. Surprisingly, they found that even when scaling inference compute by 25 times compared to naive single-sample baselines, polling-style aggregation provided no consistent accuracy improvements. In many cases, it actually amplified shared misconceptions present across models.
Social Prediction vs. Truth Verification
The study identifies a crucial distinction in how language models operate under uncertainty. Researchers discovered that models are better at predicting what other models within an ensemble will say than at identifying what is objectively true. This reveals a fundamental separation between social prediction (anticipating consensus) and truth verification (determining factual accuracy).

"Under uncertainty, models are better at predicting what other models will say within model ensembles than at identifying what is true," the authors note, highlighting how this dynamic undermines traditional wisdom-of-crowds approaches when applied to AI systems.
The Problem of Correlated Errors
A key finding explains why aggregation fails: language model errors are strongly correlated across different models and architectures. This correlation persists even when models are conditioned on out-of-distribution random strings and asked to produce pseudo-random outputs—different models still produce correlated outputs.

This error correlation means that when multiple models are wrong, they tend to be wrong in similar ways, creating the illusion of consensus around incorrect information. The source of this correlation extends beyond any individual benchmark, suggesting it's a fundamental property of how current LLMs process information.
Confidence-Based Weighting Offers No Solution
The research also examined whether weighting responses by the model's self-reported confidence could improve aggregation results. Unfortunately, self-reported confidence failed to reliably distinguish correct from incorrect answers, providing no meaningful benefit in truthfulness improvement.

This finding is particularly significant for real-world applications where users might naturally trust more confident-sounding responses. The disconnect between confidence and accuracy represents another layer of complexity in deploying trustworthy AI systems.
Implications for AI Development and Deployment
These results delineate clear boundaries for inference-time scaling strategies. In verified domains with external checkers, additional samples provide more candidates for filtering. In unverified domains, additional samples merely reinforce shared misconceptions without improving truthfulness.
The study suggests that improving LLM truthfulness will require fundamentally different approaches than simply scaling up inference compute through ensemble methods. Researchers may need to focus on architectural changes, training methodologies, or hybrid systems that incorporate external verification mechanisms even in traditionally unverified domains.
This research arrives at a critical moment in AI development, as systems like SkillsBench increasingly rely on AI agent reliability for practical applications. The findings caution against over-reliance on consensus-based approaches for truth-critical applications and highlight the need for more sophisticated verification mechanisms.
Looking Forward: Beyond Simple Aggregation
The arXiv study represents an important reality check for the AI community. While ensemble methods and inference scaling have delivered impressive results in certain domains, their limitations in truth-seeking contexts are now clearly documented. Future research will need to address the fundamental issue of correlated errors and develop methods that can genuinely distinguish between social consensus and factual accuracy.
As AI systems become increasingly integrated into decision-making processes across industries—from business intelligence to scientific research—understanding these limitations becomes crucial for developing reliable, trustworthy systems that can navigate the complex landscape of human knowledge without merely amplifying existing biases and misconceptions.


