What is the central tendency effect in LLM scoring?

It's when LLMs systematically compress predictions toward the middle of a scale, over-predicting low scores and under-predicting high scores.

Can this bias be fixed with better prompts?

No, the study found that neither few-shot examples nor removing clinical terms from the prompt eliminated the effect.

Listen to today's AI briefing

Daily podcast — 5 min, AI-narrated summary of top stories

Listen

A bar chart comparing clinical scores from human raters and MLLM raters, with MLLM scores clustered near the middle…

AI ResearchScore: 70

MLLM Raters Show Central Tendency Bias in Clinical Scoring

Study finds GPT-5 and other MLLMs show central tendency bias in clinical scoring, compressing predictions toward scale midpoint despite prompt modifications.

AAAla SMITH & AI Research Desk·May 19, 2026·3 min read··116 views·AI-Generated·Report error

Source: arxiv.orgvia arxiv_cvMulti-Source

Do multimodal LLMs show central tendency bias in clinical ordinal scoring?

A study from arXiv:2605.16386 finds multimodal LLMs (GPT-5, etc.) exhibit central tendency bias in clinical Clock Drawing Test scoring, compressing predictions toward the scale midpoint, with GPT-5 achieving MAE 0.67 and within-1 accuracy 92%.

TL;DR

GPT-5 scores Clock Drawing Test with MAE 0.67 · Fine-tuned ViT achieves best calibration MAE 0.52 · Central tendency bias persists despite few-shot and prompt changes

A new arXiv study [arXiv:2605.16386] finds GPT-5 and other multimodal LLMs systematically compress clinical scores toward the middle of the scale. The central tendency bias persists even with few-shot exemplars and prompt modifications, threatening deployment in high-stakes screening workflows.

Key facts

Fine-tuned ViT achieves MAE 0.52, best calibration
GPT-5 zero-shot achieves MAE 0.67, within-1 accuracy 92%
All LLM families show central tendency effect
Bias persists with full-range few-shot examples
Study uses Clock Drawing Test with Shulman rubric

Researchers from multiple institutions benchmarked three frontier LLM families (including OpenAI's GPT-5) against supervised deep learning models for scoring Clock Drawing Test (CDT) images using the Shulman rubric on two public datasets. The study, posted May 11, 2026, reveals a critical flaw in LLM-as-a-judge approaches for clinical ordinal scoring.

The Calibration Gap

Fully fine-tuned Vision Transformers (ViTs) achieved the best calibration: MAE 0.52 and within-1 accuracy of 91%. Zero-shot LLMs were competitive on tolerance-based agreement (GPT-5 MAE 0.67, within-1 accuracy 92%) but showed higher absolute error. The key finding: all three LLM families exhibited a pronounced central tendency effect — predictions systematically compressed toward the middle of the scale, with over-prediction at the low end (score 0 to 1) and under-prediction at the high end (score 5 to 4). [According to the arXiv preprint]

The Persistence Problem

Targeted ablations showed that neither few-shot exemplars spanning the full score range nor removing clinical terminology from the prompt eliminated the effect. This finding extends the LLM-as-a-judge bias literature from NLP evaluation to clinical assessment, where accurate scoring at the extremes most impacts screening decisions for cognitive impairment. The study highlights the need for calibration-aware evaluation and post-hoc calibration before deploying LLM-based raters in high-stakes screening workflows.

Figure 2: Score-level calibration. Supervised models (solid) cluster near the identity diagonal; LLM judges (dashed) exh

Why This Matters More Than the Press Release Suggests

The unique take here is that the central tendency bias is not a prompt engineering problem — it is structural. Unlike NLP evaluation where central tendency bias was previously documented, clinical scoring has real-world consequences: a score of 0 (severe impairment) versus 1 (mild impairment) can determine whether a patient receives follow-up care. The fact that neither few-shot examples nor prompt rewrites fix the bias suggests it may be inherent to how LLMs process ordinal scales, possibly stemming from token-level prediction objectives that favor central values.

Figure 1: Predicted-score distributions versus ground truth.Supervised models (left) approximate the true label distrib

This mirrors findings from the VAB benchmark study [gentic.news, May 14, 2026], where top MLLMs judged beauty correctly only 26.5% of the time — another case where LLM-as-a-judge approaches fail on structured perceptual tasks. The clinical domain adds urgency because errors are not academic: they affect real patient outcomes.

What to watch

Watch for follow-up work on post-hoc calibration methods for LLM clinical raters, and whether OpenAI or other providers release calibration-aware versions of their multimodal models. The next arXiv submission on this topic could reveal if the bias extends to other ordinal clinical scales beyond CDT.

Figure 6: Confusion matrices under prompt ablations.Few-shot prompting increases diagonal mass most notably for GPT-5 a

Source: gentic.news · May 19, 2026 · author=Ala SMITH · citation.json

AI-assisted reporting. Generated by gentic.news from multiple verified sources, fact-checked against the Living Graph of 4,300+ entities. Edited by Ala SMITH.

Following this story?

Get a weekly digest with AI predictions, trends, and analysis — free.

AI Analysis

This study is a sobering check on the LLM-as-a-judge hype, particularly for high-stakes clinical applications. The central tendency bias is not a new finding in NLP evaluation (see prior work on LLM judges in summarization and translation), but its persistence in multimodal clinical tasks is notable. The fact that fine-tuned ViTs outperform — with MAE 0.52 vs GPT-5's 0.67 — suggests that specialized supervised models remain superior for structured perceptual tasks where ordinal scales matter. What's most striking is the structural nature of the bias: it resists prompt engineering, which is the default 'fix' for LLM behavior. This implies the bias may be baked into the training objective — LLMs learn to predict tokens, and central values appear more frequently in training data, creating a prior that skews scoring. The clinical domain amplifies the problem because extreme scores carry disproportionate clinical weight. Comparing to the VAB benchmark results from May 14, a pattern emerges: MLLMs struggle with structured perceptual evaluation tasks that require precise ordinal judgments. This suggests that LLM-as-a-judge approaches may be fundamentally limited for tasks where the evaluation itself requires fine-grained discrimination, as opposed to open-ended quality assessment.

#llm evaluation #clinical ai #ai research #multimodal ai

Compare side-by-side

Vision Transformer vs Clock Drawing Test

→

Mentioned in this article

central tendency bias GPT-5 OpenAI Vision Transformer arXiv Clock Drawing Test Shulman rubric

Enjoyed this article?

Get the weekly AI intelligence briefing

✨AI Toolslive

Five one-click lenses on this article. Cached for 24h.

Pick a tool above to generate an instant lens on this article.

AI Research2 shared topics

OpenAI Says GPT-5.5 Instant Beats Doctors on Health Accuracy — But It Designed the Test

AI Research2 shared topics

OpenAI DeploymentSim predicts GPT-5 errors 92% of the time pre-launch

From the lab

The framework underneath this story

Every article on this site sits on top of one engine and one framework — both built by the lab.

Original research · EUMAS 2026

MNEMA — A Witness Lattice for Multi-Agent AI Memory

Cryptographic memory units · 1−α detection floor · 15 pp PDF

Field framework · v1.0

Epistemic Infrastructure

12 pillars · 11-stage knowledge metabolism · pathology catalog

MLLM Raters Show Central Tendency Bias in Clinical Scoring

The Calibration Gap

The Persistence Problem

Why This Matters More Than the Press Release Suggests

What to watch

AI Analysis

✨AI Toolslive

Related Articles

Epoch AI's EBR-Bench: Top Models Score 30-50% on Experience-Based Reasoning

GPT-5.6 Sol, Terra, Luna: Benchmark Performance Depends on Which Test You Use

White House Orders OpenAI to Gate GPT-5.6 Release per Customer

OpenAI GPT-5.5-Cyber Beats Anthropic Mythos on Security Benchmarks

OpenAI Says GPT-5.5 Instant Beats Doctors on Health Accuracy — But It Designed the Test

OpenAI DeploymentSim predicts GPT-5 errors 92% of the time pre-launch

The framework underneath this story

More in AI Research

Mira Murati's Thinking Machines beats frontier models by 29.8% with Bridgewater's expert judgments

AI Security Inst Shows Test-Time Compute Skews Frontier Evaluations

DART: One-Shot Robot Adaptation via Weight Space Arithmetic