GPT-4.1 scored 24.65% top-3 diagnostic accuracy on 5,811 real hospital dermatology cases, down from 42.25% on public benchmarks. The 5,811-case multi-site study tested five multimodal LLMs and found benchmark performance substantially overestimates clinical capability.
Key facts
- GPT-4.1 top-3 accuracy: 42.25% on benchmarks, 24.65% on real cases.
- Open-weight models: 1.50%-13.35% real-world top-3 accuracy.
- Clinical context boosted GPT-4.1 to 38.93% but brittle to errors.
- 5,811 cases, 46,405 images from multi-site hospital cohort.
- Triage sensitivity above 60% but insufficient for clinical deployment.
A new study from researchers including Roy Jiang, Hyunjae Kim, and Zhenyue Qin quantifies the gap between multimodal LLM benchmark performance and real-world clinical dermatology. The team evaluated four open-weight models — InternVL-Chat v1.5, LLaVA-Med v1.5, SkinGPT4, MedGemma-4B-Instruct — and one commercial model, GPT-4.1, across three public datasets and a retrospective multi-site hospital cohort of 5,811 cases with 46,405 clinical images [per the arXiv preprint].
Key Takeaways
- Multimodal LLMs show 10-20 point accuracy drops from benchmarks to real hospital cases.
- GPT-4.1 falls from 42.25% to 24.65%.
The Benchmark-to-Bedside Gap

On public benchmarks, the best open-weight model achieved 26.55% top-3 diagnostic accuracy, while GPT-4.1 reached 42.25%. On real-world consultation cases using images alone, open-weight models fell to 1.50%-13.35%, and GPT-4.1 dropped to 24.65%. Incorporating clinical context improved performance — open-weight models rose to 28.75% and GPT-4.1 to 38.93% — but model outputs were highly sensitive to incomplete or erroneous consultation context [according to the study].
Triage Potential, Not Diagnostic Reliability
For severity-based triage, all models achieved moderate sensitivity above 60%, suggesting potential utility for screening. However, the authors conclude the performance is "insufficient reliability for clinical deployment" for diagnosis. The unique take: this study provides the largest real-world dermatology evaluation to date, using a multi-site hospital cohort rather than curated benchmark images, revealing that public benchmarks systematically overestimate capability by 10-20 percentage points.
Brittle to Real-World Noise

The sensitivity to erroneous clinical context is particularly concerning. The study found that when provided with incomplete or incorrect consultation notes, model accuracy dropped sharply — in some cases below the image-only baseline. This brittleness means deployment in clinical settings, where documentation is often messy, would require robust guardrails not yet demonstrated.
What to watch
Watch for follow-up studies testing GPT-5.3-Codex-Spark and o-series reasoning models on the same cohort. If accuracy does not cross 50% with clinical context, the benchmark-to-bedside gap may persist across model generations, pushing deployment timelines for AI dermatology to 2028 or later.








