Skip to content
gentic.news — AI News Intelligence Platform
Connecting to the Living Graph…

Listen to today's AI briefing

Daily podcast — 5 min, AI-narrated summary of top stories

A bar chart comparing clinical scores from human raters and MLLM raters, with MLLM scores clustered near the middle…
AI ResearchScore: 60

MLLM Raters Show Central Tendency Bias in Clinical Scoring

Study finds GPT-5 and other MLLMs show central tendency bias in clinical scoring, compressing predictions toward scale midpoint despite prompt modifications.

·19h ago·3 min read··5 views·AI-Generated·Report error
Share:
Source: arxiv.orgvia arxiv_cvSingle Source
Do multimodal LLMs show central tendency bias in clinical ordinal scoring?

A study from arXiv:2605.16386 finds multimodal LLMs (GPT-5, etc.) exhibit central tendency bias in clinical Clock Drawing Test scoring, compressing predictions toward the scale midpoint, with GPT-5 achieving MAE 0.67 and within-1 accuracy 92%.

TL;DR

GPT-5 scores Clock Drawing Test with MAE 0.67 · Fine-tuned ViT achieves best calibration MAE 0.52 · Central tendency bias persists despite few-shot and prompt changes

A new arXiv study [arXiv:2605.16386] finds GPT-5 and other multimodal LLMs systematically compress clinical scores toward the middle of the scale. The central tendency bias persists even with few-shot exemplars and prompt modifications, threatening deployment in high-stakes screening workflows.

Key facts

  • Fine-tuned ViT achieves MAE 0.52, best calibration
  • GPT-5 zero-shot achieves MAE 0.67, within-1 accuracy 92%
  • All LLM families show central tendency effect
  • Bias persists with full-range few-shot examples
  • Study uses Clock Drawing Test with Shulman rubric

Researchers from multiple institutions benchmarked three frontier LLM families (including OpenAI's GPT-5) against supervised deep learning models for scoring Clock Drawing Test (CDT) images using the Shulman rubric on two public datasets. The study, posted May 11, 2026, reveals a critical flaw in LLM-as-a-judge approaches for clinical ordinal scoring.

The Calibration Gap

Fully fine-tuned Vision Transformers (ViTs) achieved the best calibration: MAE 0.52 and within-1 accuracy of 91%. Zero-shot LLMs were competitive on tolerance-based agreement (GPT-5 MAE 0.67, within-1 accuracy 92%) but showed higher absolute error. The key finding: all three LLM families exhibited a pronounced central tendency effect — predictions systematically compressed toward the middle of the scale, with over-prediction at the low end (score 0 to 1) and under-prediction at the high end (score 5 to 4). [According to the arXiv preprint]

The Persistence Problem

Targeted ablations showed that neither few-shot exemplars spanning the full score range nor removing clinical terminology from the prompt eliminated the effect. This finding extends the LLM-as-a-judge bias literature from NLP evaluation to clinical assessment, where accurate scoring at the extremes most impacts screening decisions for cognitive impairment. The study highlights the need for calibration-aware evaluation and post-hoc calibration before deploying LLM-based raters in high-stakes screening workflows.

Figure 2: Score-level calibration. Supervised models (solid) cluster near the identity diagonal; LLM judges (dashed) exh

Why This Matters More Than the Press Release Suggests

The unique take here is that the central tendency bias is not a prompt engineering problem — it is structural. Unlike NLP evaluation where central tendency bias was previously documented, clinical scoring has real-world consequences: a score of 0 (severe impairment) versus 1 (mild impairment) can determine whether a patient receives follow-up care. The fact that neither few-shot examples nor prompt rewrites fix the bias suggests it may be inherent to how LLMs process ordinal scales, possibly stemming from token-level prediction objectives that favor central values.

Figure 1: Predicted-score distributions versus ground truth.Supervised models (left) approximate the true label distrib

This mirrors findings from the VAB benchmark study [gentic.news, May 14, 2026], where top MLLMs judged beauty correctly only 26.5% of the time — another case where LLM-as-a-judge approaches fail on structured perceptual tasks. The clinical domain adds urgency because errors are not academic: they affect real patient outcomes.

What to watch

Watch for follow-up work on post-hoc calibration methods for LLM clinical raters, and whether OpenAI or other providers release calibration-aware versions of their multimodal models. The next arXiv submission on this topic could reveal if the bias extends to other ordinal clinical scales beyond CDT.

Figure 6: Confusion matrices under prompt ablations.Few-shot prompting increases diagonal mass most notably for GPT-5 a


Source: gentic.news · · author= · citation.json

AI-assisted reporting. Generated by gentic.news from multiple verified sources, fact-checked against the Living Graph of 4,300+ entities. Edited by Ala SMITH.

Following this story?

Get a weekly digest with AI predictions, trends, and analysis — free.

AI Analysis

This study is a sobering check on the LLM-as-a-judge hype, particularly for high-stakes clinical applications. The central tendency bias is not a new finding in NLP evaluation (see prior work on LLM judges in summarization and translation), but its persistence in multimodal clinical tasks is notable. The fact that fine-tuned ViTs outperform — with MAE 0.52 vs GPT-5's 0.67 — suggests that specialized supervised models remain superior for structured perceptual tasks where ordinal scales matter. What's most striking is the structural nature of the bias: it resists prompt engineering, which is the default 'fix' for LLM behavior. This implies the bias may be baked into the training objective — LLMs learn to predict tokens, and central values appear more frequently in training data, creating a prior that skews scoring. The clinical domain amplifies the problem because extreme scores carry disproportionate clinical weight. Comparing to the VAB benchmark results from May 14, a pattern emerges: MLLMs struggle with structured perceptual evaluation tasks that require precise ordinal judgments. This suggests that LLM-as-a-judge approaches may be fundamentally limited for tasks where the evaluation itself requires fine-grained discrimination, as opposed to open-ended quality assessment.
Compare side-by-side
Vision Transformer vs Clock Drawing Test
Enjoyed this article?
Share:

AI Toolslive

Five one-click lenses on this article. Cached for 24h.

Pick a tool above to generate an instant lens on this article.

Related Articles

From the lab

The framework underneath this story

Every article on this site sits on top of one engine and one framework — both built by the lab.

More in AI Research

View all