A new arXiv study [arXiv:2605.16386] finds GPT-5 and other multimodal LLMs systematically compress clinical scores toward the middle of the scale. The central tendency bias persists even with few-shot exemplars and prompt modifications, threatening deployment in high-stakes screening workflows.
Key facts
- Fine-tuned ViT achieves MAE 0.52, best calibration
- GPT-5 zero-shot achieves MAE 0.67, within-1 accuracy 92%
- All LLM families show central tendency effect
- Bias persists with full-range few-shot examples
- Study uses Clock Drawing Test with Shulman rubric
Researchers from multiple institutions benchmarked three frontier LLM families (including OpenAI's GPT-5) against supervised deep learning models for scoring Clock Drawing Test (CDT) images using the Shulman rubric on two public datasets. The study, posted May 11, 2026, reveals a critical flaw in LLM-as-a-judge approaches for clinical ordinal scoring.
The Calibration Gap
Fully fine-tuned Vision Transformers (ViTs) achieved the best calibration: MAE 0.52 and within-1 accuracy of 91%. Zero-shot LLMs were competitive on tolerance-based agreement (GPT-5 MAE 0.67, within-1 accuracy 92%) but showed higher absolute error. The key finding: all three LLM families exhibited a pronounced central tendency effect — predictions systematically compressed toward the middle of the scale, with over-prediction at the low end (score 0 to 1) and under-prediction at the high end (score 5 to 4). [According to the arXiv preprint]
The Persistence Problem
Targeted ablations showed that neither few-shot exemplars spanning the full score range nor removing clinical terminology from the prompt eliminated the effect. This finding extends the LLM-as-a-judge bias literature from NLP evaluation to clinical assessment, where accurate scoring at the extremes most impacts screening decisions for cognitive impairment. The study highlights the need for calibration-aware evaluation and post-hoc calibration before deploying LLM-based raters in high-stakes screening workflows.

Why This Matters More Than the Press Release Suggests
The unique take here is that the central tendency bias is not a prompt engineering problem — it is structural. Unlike NLP evaluation where central tendency bias was previously documented, clinical scoring has real-world consequences: a score of 0 (severe impairment) versus 1 (mild impairment) can determine whether a patient receives follow-up care. The fact that neither few-shot examples nor prompt rewrites fix the bias suggests it may be inherent to how LLMs process ordinal scales, possibly stemming from token-level prediction objectives that favor central values.

This mirrors findings from the VAB benchmark study [gentic.news, May 14, 2026], where top MLLMs judged beauty correctly only 26.5% of the time — another case where LLM-as-a-judge approaches fail on structured perceptual tasks. The clinical domain adds urgency because errors are not academic: they affect real patient outcomes.
What to watch
Watch for follow-up work on post-hoc calibration methods for LLM clinical raters, and whether OpenAI or other providers release calibration-aware versions of their multimodal models. The next arXiv submission on this topic could reveal if the bias extends to other ordinal clinical scales beyond CDT.









