Skip to content
gentic.news — AI News Intelligence Platform
Connecting to the Living Graph…

Listen to today's AI briefing

Daily podcast — 5 min, AI-narrated summary of top stories

A large language model interface displays Qwen 2.5 7B with a near-constant confidence score of 0.856, while…
AI ResearchScore: 78

Qwen 2.5 7B Verbalized Confidence Is Epistemically Vacuous, Paper Finds

Qwen 2.5 7B's confidence is near-constant (0.856–0.937) across accuracy from 49% to 75.3%. Combining SHAP with few-shot examples cuts ADS from 1.54 to 0.38 and lifts accuracy to 75.3%.

·17h ago·3 min read··23 views·AI-Generated·Report error
Share:
Source: arxiv.orgvia arxiv_aiCorroborated
Does Qwen 2.5 7B know when it doesn't know on clinical tabular data?

A June 2026 arXiv paper shows Qwen 2.5 7B's verbalized confidence is near-constant (0.856–0.937) regardless of accuracy, with inverse difficulty effect where accuracy drops to 64.8% when XGBoost is 99% correct. Combining SHAP evidence with few-shot examples cuts Attribution Disagreement Score from 1.54 to 0.38.

TL;DR

LLM confidence scores are near-constant (0.856–0.937). · Accuracy drops to 64.8% when XGBoost is 99% correct. · SHAP + few-shot cuts ADS from 1.54 to 0.38. · Cross-model calibrator reduces ECE from 0.254 to 0.080.

Qwen 2.5 7B outputs near-constant verbalized confidence (0.856–0.937) whether its accuracy is 49% or 75.3%. A June 2026 arXiv paper from University of Minnesota researchers reveals the model's epistemic blind spots on structured clinical data via cross-model attribution divergence with XGBoost.

Key facts

  • Qwen 2.5 7B confidence range: 0.856–0.937 across 49%–75.3% accuracy.
  • Accuracy drops to 64.8% when XGBoost is 99% correct.
  • Few-shot + SHAP improves accuracy from 49% to 75.3%.
  • Cross-model calibrator reduces ECE from 0.254 to 0.080.
  • Attribution Disagreement Score drops from 1.54 to 0.38 with combined intervention.

Large language models deployed on structured clinical data cannot reliably signal when they are wrong. A new paper, LLM Doesn't Know What It Doesn't Know by Akshat Dasula, Prasanna Desikan, and Jaideep Srivastava, tests this through cross-model attribution divergence — comparing Qwen 2.5 7B's feature attributions against XGBoost's SHAP values on a clinical prediction task.

The Confidence Mirage

The LLM's verbalized confidence is epistemically vacuous. The paper reports that Qwen 2.5 7B outputs a near-constant range (0.856–0.937) regardless of whether accuracy is 49% or 75.3%. The confidence score tracks prompt format, not prediction quality [per the arXiv preprint].

Worse, the model exhibits an inverse difficulty effect: accuracy drops to 64.8% when XGBoost is 99% correct, but matches XGBoost (73.8% vs. 73.1%) when it is moderately uncertain. The model is most confidently wrong exactly when the task is easiest for a traditional ML model.

Super-Additive Interventions

Few-shot examples and SHAP-derived feature evidence are orthogonal interventions. Alone, few-shot examples yield 49% accuracy. SHAP injection alone performs similarly. Combined, they reduce the Attribution Disagreement Score (ADS) from 1.54 to 0.38 and improve accuracy from 49% to 75.3% — without any training [according to the paper's Table 1].

Calibration Without Internals

The paper's fourth finding is the most practically useful. A cross-model calibrator that determines LLM reliability using attribution divergence signals reduces expected calibration error from 0.254 to 0.080. This replaces uninformative verbalized confidence with patient-specific reliability estimates, without accessing model internals or requiring repeated inference.

The authors frame the findings as a cold start problem: LLMs on structured data lack prior knowledge of the feature space, unlike their pretraining on natural language. The path toward epistemic self-awareness requires cross-model attribution signals, not better verbalized confidence.

Key Takeaways

  • Qwen 2.5 7B's confidence is near-constant (0.856–0.937) across accuracy from 49% to 75.3%.
  • Combining SHAP with few-shot examples cuts ADS from 1.54 to 0.38 and lifts accuracy to 75.3%.

What to watch

lm-kit/qwen-2.5-7b-instruct-gguf at main

Watch for follow-up work that extends the cross-model calibrator to larger models like Llama 3 70B or GPT-4o on clinical benchmarks, and whether the super-additive SHAP+few-shot effect generalizes beyond tabular data to multimodal clinical inputs like radiology reports.


Source: arxiv.org


Source: gentic.news · · author= · citation.json

AI-assisted reporting. Generated by gentic.news from multiple verified sources, fact-checked against the Living Graph of 4,300+ entities. Edited by Ala SMITH.

Following this story?

Get a weekly digest with AI predictions, trends, and analysis — free.

AI Analysis

The paper's inverse difficulty effect is the most striking finding: Qwen 2.5 7B is most confidently wrong on easy cases. This mirrors the 'overconfidence on in-distribution' pattern seen in vision models but is more dangerous in clinical settings where straightforward cases are the most common. The cross-model calibrator is a practical workaround, but it depends on having a reliable reference model — a circular dependency if the reference model itself is flawed. The super-additive effect of SHAP + few-shot is reminiscent of retrieval-augmented generation but with a key difference: RAG retrieves text chunks, while SHAP provides structured feature-level evidence. This suggests that LLMs on tabular data need explicit feature attribution signals, not just textual context. Missing from the paper: no ablation on the choice of reference model (why XGBoost over logistic regression or a neural network?), and no test on clinical data with missing or noisy features. The cold start framing is apt but the paper doesn't propose a solution for the cold start — only detection.
Compare side-by-side
Prasanna Desikan vs Akshat Dasula
Enjoyed this article?
Share:

AI Toolslive

Five one-click lenses on this article. Cached for 24h.

Pick a tool above to generate an instant lens on this article.

Related Articles

From the lab

The framework underneath this story

Every article on this site sits on top of one engine and one framework — both built by the lab.

More in AI Research

View all