Qwen 2.5 7B outputs near-constant verbalized confidence (0.856–0.937) whether its accuracy is 49% or 75.3%. A June 2026 arXiv paper from University of Minnesota researchers reveals the model's epistemic blind spots on structured clinical data via cross-model attribution divergence with XGBoost.
Key facts
- Qwen 2.5 7B confidence range: 0.856–0.937 across 49%–75.3% accuracy.
- Accuracy drops to 64.8% when XGBoost is 99% correct.
- Few-shot + SHAP improves accuracy from 49% to 75.3%.
- Cross-model calibrator reduces ECE from 0.254 to 0.080.
- Attribution Disagreement Score drops from 1.54 to 0.38 with combined intervention.
Large language models deployed on structured clinical data cannot reliably signal when they are wrong. A new paper, LLM Doesn't Know What It Doesn't Know by Akshat Dasula, Prasanna Desikan, and Jaideep Srivastava, tests this through cross-model attribution divergence — comparing Qwen 2.5 7B's feature attributions against XGBoost's SHAP values on a clinical prediction task.
The Confidence Mirage
The LLM's verbalized confidence is epistemically vacuous. The paper reports that Qwen 2.5 7B outputs a near-constant range (0.856–0.937) regardless of whether accuracy is 49% or 75.3%. The confidence score tracks prompt format, not prediction quality [per the arXiv preprint].
Worse, the model exhibits an inverse difficulty effect: accuracy drops to 64.8% when XGBoost is 99% correct, but matches XGBoost (73.8% vs. 73.1%) when it is moderately uncertain. The model is most confidently wrong exactly when the task is easiest for a traditional ML model.
Super-Additive Interventions
Few-shot examples and SHAP-derived feature evidence are orthogonal interventions. Alone, few-shot examples yield 49% accuracy. SHAP injection alone performs similarly. Combined, they reduce the Attribution Disagreement Score (ADS) from 1.54 to 0.38 and improve accuracy from 49% to 75.3% — without any training [according to the paper's Table 1].
Calibration Without Internals
The paper's fourth finding is the most practically useful. A cross-model calibrator that determines LLM reliability using attribution divergence signals reduces expected calibration error from 0.254 to 0.080. This replaces uninformative verbalized confidence with patient-specific reliability estimates, without accessing model internals or requiring repeated inference.
The authors frame the findings as a cold start problem: LLMs on structured data lack prior knowledge of the feature space, unlike their pretraining on natural language. The path toward epistemic self-awareness requires cross-model attribution signals, not better verbalized confidence.
Key Takeaways
- Qwen 2.5 7B's confidence is near-constant (0.856–0.937) across accuracy from 49% to 75.3%.
- Combining SHAP with few-shot examples cuts ADS from 1.54 to 0.38 and lifts accuracy to 75.3%.
What to watch
![]()
Watch for follow-up work that extends the cross-model calibrator to larger models like Llama 3 70B or GPT-4o on clinical benchmarks, and whether the super-additive SHAP+few-shot effect generalizes beyond tabular data to multimodal clinical inputs like radiology reports.
Source: arxiv.org








