What is cross-model attribution divergence?

It compares feature attributions from an LLM (via SHAP values) against a reference model like XGBoost to detect when the LLM's reasoning diverges from a reliable baseline.

Why does Qwen 2.5 7B's confidence fail on clinical data?

The model's verbalized confidence tracks prompt format rather than prediction quality, producing a near-constant score regardless of actual accuracy.

Listen to today's AI briefing

Daily podcast — 5 min, AI-narrated summary of top stories

Listen

A large language model interface displays Qwen 2.5 7B with a near-constant confidence score of 0.856, while…

AI ResearchScore: 78

Qwen 2.5 7B Verbalized Confidence Is Epistemically Vacuous, Paper Finds

Qwen 2.5 7B's confidence is near-constant (0.856–0.937) across accuracy from 49% to 75.3%. Combining SHAP with few-shot examples cuts ADS from 1.54 to 0.38 and lifts accuracy to 75.3%.

AAAla SMITH & AI Research Desk·17h ago·3 min read··23 views·AI-Generated·Report error

Source: arxiv.orgvia arxiv_aiCorroborated

Does Qwen 2.5 7B know when it doesn't know on clinical tabular data?

A June 2026 arXiv paper shows Qwen 2.5 7B's verbalized confidence is near-constant (0.856–0.937) regardless of accuracy, with inverse difficulty effect where accuracy drops to 64.8% when XGBoost is 99% correct. Combining SHAP evidence with few-shot examples cuts Attribution Disagreement Score from 1.54 to 0.38.

TL;DR

LLM confidence scores are near-constant (0.856–0.937). · Accuracy drops to 64.8% when XGBoost is 99% correct. · SHAP + few-shot cuts ADS from 1.54 to 0.38. · Cross-model calibrator reduces ECE from 0.254 to 0.080.

Qwen 2.5 7B outputs near-constant verbalized confidence (0.856–0.937) whether its accuracy is 49% or 75.3%. A June 2026 arXiv paper from University of Minnesota researchers reveals the model's epistemic blind spots on structured clinical data via cross-model attribution divergence with XGBoost.

Key facts

Qwen 2.5 7B confidence range: 0.856–0.937 across 49%–75.3% accuracy.
Accuracy drops to 64.8% when XGBoost is 99% correct.
Few-shot + SHAP improves accuracy from 49% to 75.3%.
Cross-model calibrator reduces ECE from 0.254 to 0.080.
Attribution Disagreement Score drops from 1.54 to 0.38 with combined intervention.

Large language models deployed on structured clinical data cannot reliably signal when they are wrong. A new paper, LLM Doesn't Know What It Doesn't Know by Akshat Dasula, Prasanna Desikan, and Jaideep Srivastava, tests this through cross-model attribution divergence — comparing Qwen 2.5 7B's feature attributions against XGBoost's SHAP values on a clinical prediction task.

The Confidence Mirage

The LLM's verbalized confidence is epistemically vacuous. The paper reports that Qwen 2.5 7B outputs a near-constant range (0.856–0.937) regardless of whether accuracy is 49% or 75.3%. The confidence score tracks prompt format, not prediction quality [per the arXiv preprint].

Worse, the model exhibits an inverse difficulty effect: accuracy drops to 64.8% when XGBoost is 99% correct, but matches XGBoost (73.8% vs. 73.1%) when it is moderately uncertain. The model is most confidently wrong exactly when the task is easiest for a traditional ML model.

Super-Additive Interventions

Few-shot examples and SHAP-derived feature evidence are orthogonal interventions. Alone, few-shot examples yield 49% accuracy. SHAP injection alone performs similarly. Combined, they reduce the Attribution Disagreement Score (ADS) from 1.54 to 0.38 and improve accuracy from 49% to 75.3% — without any training [according to the paper's Table 1].

Calibration Without Internals

The paper's fourth finding is the most practically useful. A cross-model calibrator that determines LLM reliability using attribution divergence signals reduces expected calibration error from 0.254 to 0.080. This replaces uninformative verbalized confidence with patient-specific reliability estimates, without accessing model internals or requiring repeated inference.

The authors frame the findings as a cold start problem: LLMs on structured data lack prior knowledge of the feature space, unlike their pretraining on natural language. The path toward epistemic self-awareness requires cross-model attribution signals, not better verbalized confidence.

Key Takeaways

Qwen 2.5 7B's confidence is near-constant (0.856–0.937) across accuracy from 49% to 75.3%.
Combining SHAP with few-shot examples cuts ADS from 1.54 to 0.38 and lifts accuracy to 75.3%.

What to watch

Watch for follow-up work that extends the cross-model calibrator to larger models like Llama 3 70B or GPT-4o on clinical benchmarks, and whether the super-additive SHAP+few-shot effect generalizes beyond tabular data to multimodal clinical inputs like radiology reports.

Source: arxiv.org

Source: gentic.news · 17h ago · author=Ala SMITH · citation.json

AI-assisted reporting. Generated by gentic.news from multiple verified sources, fact-checked against the Living Graph of 4,300+ entities. Edited by Ala SMITH.

Following this story?

Get a weekly digest with AI predictions, trends, and analysis — free.

AI Analysis

The paper's inverse difficulty effect is the most striking finding: Qwen 2.5 7B is most confidently wrong on easy cases. This mirrors the 'overconfidence on in-distribution' pattern seen in vision models but is more dangerous in clinical settings where straightforward cases are the most common. The cross-model calibrator is a practical workaround, but it depends on having a reliable reference model — a circular dependency if the reference model itself is flawed. The super-additive effect of SHAP + few-shot is reminiscent of retrieval-augmented generation but with a key difference: RAG retrieves text chunks, while SHAP provides structured feature-level evidence. This suggests that LLMs on tabular data need explicit feature attribution signals, not just textual context. Missing from the paper: no ablation on the choice of reference model (why XGBoost over logistic regression or a neural network?), and no test on clinical data with missing or noisy features. The cold start framing is apt but the paper doesn't propose a solution for the cold start — only detection.

#research #safety #tabular data #clinical ai

Compare side-by-side

Prasanna Desikan vs Akshat Dasula

→

Mentioned in this article

Qwen 2.5 7B XGBoost SHAP University of Minnesota Prasanna Desikan Akshat Dasula Jaideep Srivastava

Enjoyed this article?

Get the weekly AI intelligence briefing

✨AI Toolslive

Five one-click lenses on this article. Cached for 24h.

Pick a tool above to generate an instant lens on this article.

AI Research

MiniMax M3 Exceeds Human Gold-Medal on Math Benchmarks via MaxProof

From the lab

The framework underneath this story

Every article on this site sits on top of one engine and one framework — both built by the lab.

Original research · EUMAS 2026

MNEMA — A Witness Lattice for Multi-Agent AI Memory

Cryptographic memory units · 1−α detection floor · 15 pp PDF

Field framework · v1.0

Epistemic Infrastructure

12 pillars · 11-stage knowledge metabolism · pathology catalog

Qwen 2.5 7B Verbalized Confidence Is Epistemically Vacuous, Paper Finds

The Confidence Mirage

Super-Additive Interventions

Calibration Without Internals

Key Takeaways

What to watch

AI Analysis

✨AI Toolslive

Related Articles

How to Govern Claude Code Across Your Team: 4 Gaps to Fix Before the Next CVE

OpenAI Can Predict Model Failures via Past Chat Replay

Anthropic Study: Senior Engineers Beat Juniors With AI by 31%

NVIDIA Blackwell Sweeps MLPerf Training 6.0, GB300 Hits 1.6x Speedup

CoreWeave Trains DeepSeek-V3 in 2 Minutes, Claims MLPerf v6.0 Record

MiniMax M3 Exceeds Human Gold-Medal on Math Benchmarks via MaxProof

The framework underneath this story

More in AI Research

1.3B-Parameter Rectified Flow Transformer Generates Chest X-Rays

SciRisk-Bench Tests 10 Risk Dimensions Across 7 Science Disciplines

BeliefDiffusion Uses Diffusion Models for Robot Navigation in Partially