Skip to content
gentic.news — AI News Intelligence Platform
Connecting to the Living Graph…

Listen to today's AI briefing

Daily podcast — 5 min, AI-narrated summary of top stories

Bar chart comparing GPT-4.1's diagnostic accuracy on real dermatology cases (24.65%) versus public benchmarks…
AI ResearchScore: 74

GPT-4.1 Hits 24.65% Derm Accuracy on Real Cases vs 42.25% Benchmarks

Multimodal LLMs show 10-20 point accuracy drops from benchmarks to real hospital cases. GPT-4.1 falls from 42.25% to 24.65%.

·6h ago·3 min read··9 views·AI-Generated·Report error
Share:
Source: arxiv.orgvia arxiv_cvCorroborated
What is the real-world diagnostic accuracy of multimodal LLMs in dermatology?

A 5,811-case hospital study found GPT-4.1 achieved 24.65% top-3 diagnostic accuracy on real dermatology cases, down from 42.25% on public benchmarks. Open-weight models scored 1.5%-13.35%. Clinical context improved scores but models remained unreliable for deployment.

TL;DR

GPT-4.1 top-3 accuracy drops from 42.25% to 24.65% on real cases. · Open-weight models fall to 1.5%-13.35% on hospital data. · Clinical context helps but models are brittle to errors.

GPT-4.1 scored 24.65% top-3 diagnostic accuracy on 5,811 real hospital dermatology cases, down from 42.25% on public benchmarks. The 5,811-case multi-site study tested five multimodal LLMs and found benchmark performance substantially overestimates clinical capability.

Key facts

  • GPT-4.1 top-3 accuracy: 42.25% on benchmarks, 24.65% on real cases.
  • Open-weight models: 1.50%-13.35% real-world top-3 accuracy.
  • Clinical context boosted GPT-4.1 to 38.93% but brittle to errors.
  • 5,811 cases, 46,405 images from multi-site hospital cohort.
  • Triage sensitivity above 60% but insufficient for clinical deployment.

A new study from researchers including Roy Jiang, Hyunjae Kim, and Zhenyue Qin quantifies the gap between multimodal LLM benchmark performance and real-world clinical dermatology. The team evaluated four open-weight models — InternVL-Chat v1.5, LLaVA-Med v1.5, SkinGPT4, MedGemma-4B-Instruct — and one commercial model, GPT-4.1, across three public datasets and a retrospective multi-site hospital cohort of 5,811 cases with 46,405 clinical images [per the arXiv preprint].

Key Takeaways

  • Multimodal LLMs show 10-20 point accuracy drops from benchmarks to real hospital cases.
  • GPT-4.1 falls from 42.25% to 24.65%.

The Benchmark-to-Bedside Gap

How GPT-4.1 compares to GPT-4o. Updated: Septemebr 3rd, 2025 | by ...

On public benchmarks, the best open-weight model achieved 26.55% top-3 diagnostic accuracy, while GPT-4.1 reached 42.25%. On real-world consultation cases using images alone, open-weight models fell to 1.50%-13.35%, and GPT-4.1 dropped to 24.65%. Incorporating clinical context improved performance — open-weight models rose to 28.75% and GPT-4.1 to 38.93% — but model outputs were highly sensitive to incomplete or erroneous consultation context [according to the study].

Triage Potential, Not Diagnostic Reliability

For severity-based triage, all models achieved moderate sensitivity above 60%, suggesting potential utility for screening. However, the authors conclude the performance is "insufficient reliability for clinical deployment" for diagnosis. The unique take: this study provides the largest real-world dermatology evaluation to date, using a multi-site hospital cohort rather than curated benchmark images, revealing that public benchmarks systematically overestimate capability by 10-20 percentage points.

Brittle to Real-World Noise

How GPT-5 compares to Claude Opus 4.1 | by Barnacle Goose | Medium

The sensitivity to erroneous clinical context is particularly concerning. The study found that when provided with incomplete or incorrect consultation notes, model accuracy dropped sharply — in some cases below the image-only baseline. This brittleness means deployment in clinical settings, where documentation is often messy, would require robust guardrails not yet demonstrated.

What to watch

Watch for follow-up studies testing GPT-5.3-Codex-Spark and o-series reasoning models on the same cohort. If accuracy does not cross 50% with clinical context, the benchmark-to-bedside gap may persist across model generations, pushing deployment timelines for AI dermatology to 2028 or later.


Source: gentic.news · · author= · citation.json

AI-assisted reporting. Generated by gentic.news from multiple verified sources, fact-checked against the Living Graph of 4,300+ entities. Edited by Ala SMITH.

Following this story?

Get a weekly digest with AI predictions, trends, and analysis — free.

AI Analysis

This study is the strongest evidence yet that dermatology MLLM benchmarks are saturated and misleading. The 5,811-case hospital cohort is an order of magnitude larger than typical public datasets, and the multi-site design reduces single-institution bias. The 10-20 percentage point drop between benchmarks and real-world performance mirrors patterns seen in radiology AI, where FDA-cleared models often degrade 15-30% in deployment. The brittleness to erroneous clinical context is the most worrying finding — it suggests these models are learning spurious correlations between image features and dataset-specific text patterns rather than robust diagnostic reasoning. The triage sensitivity above 60% is a silver lining but insufficient for autonomous use; screening with human oversight may be the only near-term path.
Compare side-by-side
GPT-4.1 vs SkinGPT4
Enjoyed this article?
Share:

AI Toolslive

Five one-click lenses on this article. Cached for 24h.

Pick a tool above to generate an instant lens on this article.

Related Articles

From the lab

The framework underneath this story

Every article on this site sits on top of one engine and one framework — both built by the lab.

More in AI Research

View all