AI's Bullshit Problem: New Benchmark Reveals Models Stagnating on Factual Accuracy
A new version of a critical AI benchmark has revealed troubling stagnation in how language models handle factual accuracy. BullshitBench v2, recently released by researcher Peter Gostev and highlighted by AI commentator Kimi (kimmonismus), shows that most major AI models are not improving at avoiding what researchers call "bullshit"—confidently stated false information that sounds plausible.
What BullshitBench Actually Measures
BullshitBench isn't about measuring creative writing or coding ability. Instead, it focuses specifically on a model's tendency to generate factual inaccuracies while maintaining high confidence. The benchmark presents models with questions designed to test whether they will:
- Acknowledge when they don't know something
- Provide accurate information when they do know
- Avoid generating plausible-sounding falsehoods
The term "bullshit" in this context comes from philosopher Harry Frankfurt's conceptualization: statements made without regard for truth, distinct from outright lies (which intentionally deceive). For AI systems, this manifests as models generating authoritative-sounding responses that contain factual errors, often because the model prioritizes linguistic coherence over factual verification.
The Concerning Results
According to the benchmark results shared by Gostev, most models show little to no improvement between versions on this specific metric. The notable exception appears to be Anthropic's Claude, which has demonstrated measurable progress in reducing bullshit generation.
This stagnation is particularly concerning given the rapid improvements seen in other benchmarks measuring different capabilities. While models are getting better at coding, creative writing, and reasoning tasks, their fundamental tendency to generate confident falsehoods appears more resistant to improvement.
Why This Matters for AI Safety and Deployment
The implications extend far beyond academic interest. As AI systems are increasingly deployed in educational, medical, legal, and journalistic contexts, their tendency to generate plausible falsehoods represents a significant safety concern.
Real-world consequences include:
- Students receiving incorrect information presented as fact
- Medical or legal advice containing dangerous inaccuracies
- Misinformation spreading through AI-assisted content creation
- Erosion of trust in AI systems generally
The Technical Challenge of Reducing Bullshit
Reducing bullshit generation presents unique technical challenges. Unlike improving factual accuracy through better training data (which helps), bullshit reduction requires changes to how models handle uncertainty and confidence calibration.
Current approaches include:
- Improved uncertainty quantification: Teaching models to better recognize when they're uncertain
- Retrieval-augmented generation: Grounding responses in verified external sources
- Constitutional AI techniques: Building in principles that prioritize truthfulness
- Better calibration: Aligning confidence levels with actual accuracy
Claude's apparent progress suggests some of these approaches may be working, but the general stagnation across other models indicates the problem remains fundamentally difficult.
Industry Response and Future Directions
The AI industry has generally prioritized capabilities over safety metrics like bullshit reduction. Benchmarks that measure harmful outputs or factual inaccuracies often receive less attention than those measuring positive capabilities.
However, as regulatory pressure increases and real-world deployments encounter problems, metrics like BullshitBench may gain prominence. The European Union's AI Act, for instance, includes specific requirements for transparency about AI limitations, which would require better bullshit detection and avoidance.
Future developments to watch include:
- Whether other model developers prioritize bullshit reduction
- How retrieval-augmented approaches affect benchmark results
- Whether synthetic data training exacerbates or alleviates the problem
- Regulatory requirements for truthfulness metrics
The Philosophical Dimension
Beyond the technical challenges lies a philosophical question: Can language models ever truly "care" about truth? As philosopher Frankfurt noted, bullshit is distinct from lying because it disregards truth altogether rather than opposing it. Current AI systems, which optimize for statistical patterns rather than truth-seeking, may be inherently prone to bullshit generation.
This raises fundamental questions about whether truthfulness can be engineered into systems that don't have genuine understanding or intentionality—or whether different approaches (like hybrid human-AI systems or fundamentally different architectures) will be necessary.
Source: Benchmark results shared by Peter Gostev via Kimi (kimmonismus) on X/Twitter



