Listen to today's AI briefing

Daily podcast — 5 min, AI-narrated summary of top stories

Bar chart comparing GPT-4.1's diagnostic accuracy on real dermatology cases (24.65%) versus public benchmarks…

GPT-4.1 Hits 24.65% Derm Accuracy on Real Cases vs 42.25% Benchmarks

Multimodal LLMs show 10-20 point accuracy drops from benchmarks to real hospital cases. GPT-4.1 falls from 42.25% to 24.65%.

AAAla SMITH & AI Research Desk·6h ago·3 min read··9 views·AI-Generated·Report error

Source: arxiv.orgvia arxiv_cvCorroborated

What is the real-world diagnostic accuracy of multimodal LLMs in dermatology?

A 5,811-case hospital study found GPT-4.1 achieved 24.65% top-3 diagnostic accuracy on real dermatology cases, down from 42.25% on public benchmarks. Open-weight models scored 1.5%-13.35%. Clinical context improved scores but models remained unreliable for deployment.

TL;DR

GPT-4.1 top-3 accuracy drops from 42.25% to 24.65% on real cases. · Open-weight models fall to 1.5%-13.35% on hospital data. · Clinical context helps but models are brittle to errors.

GPT-4.1 scored 24.65% top-3 diagnostic accuracy on 5,811 real hospital dermatology cases, down from 42.25% on public benchmarks. The 5,811-case multi-site study tested five multimodal LLMs and found benchmark performance substantially overestimates clinical capability.

Key facts

GPT-4.1 top-3 accuracy: 42.25% on benchmarks, 24.65% on real cases.
Open-weight models: 1.50%-13.35% real-world top-3 accuracy.
Clinical context boosted GPT-4.1 to 38.93% but brittle to errors.
5,811 cases, 46,405 images from multi-site hospital cohort.
Triage sensitivity above 60% but insufficient for clinical deployment.

A new study from researchers including Roy Jiang, Hyunjae Kim, and Zhenyue Qin quantifies the gap between multimodal LLM benchmark performance and real-world clinical dermatology. The team evaluated four open-weight models — InternVL-Chat v1.5, LLaVA-Med v1.5, SkinGPT4, MedGemma-4B-Instruct — and one commercial model, GPT-4.1, across three public datasets and a retrospective multi-site hospital cohort of 5,811 cases with 46,405 clinical images [per the arXiv preprint].

Key Takeaways

Multimodal LLMs show 10-20 point accuracy drops from benchmarks to real hospital cases.
GPT-4.1 falls from 42.25% to 24.65%.

The Benchmark-to-Bedside Gap

How GPT-4.1 compares to GPT-4o. Updated: Septemebr 3rd, 2025 | by ...

On public benchmarks, the best open-weight model achieved 26.55% top-3 diagnostic accuracy, while GPT-4.1 reached 42.25%. On real-world consultation cases using images alone, open-weight models fell to 1.50%-13.35%, and GPT-4.1 dropped to 24.65%. Incorporating clinical context improved performance — open-weight models rose to 28.75% and GPT-4.1 to 38.93% — but model outputs were highly sensitive to incomplete or erroneous consultation context [according to the study].

Triage Potential, Not Diagnostic Reliability

For severity-based triage, all models achieved moderate sensitivity above 60%, suggesting potential utility for screening. However, the authors conclude the performance is "insufficient reliability for clinical deployment" for diagnosis. The unique take: this study provides the largest real-world dermatology evaluation to date, using a multi-site hospital cohort rather than curated benchmark images, revealing that public benchmarks systematically overestimate capability by 10-20 percentage points.

Brittle to Real-World Noise

How GPT-5 compares to Claude Opus 4.1 | by Barnacle Goose | Medium

The sensitivity to erroneous clinical context is particularly concerning. The study found that when provided with incomplete or incorrect consultation notes, model accuracy dropped sharply — in some cases below the image-only baseline. This brittleness means deployment in clinical settings, where documentation is often messy, would require robust guardrails not yet demonstrated.

What to watch

Watch for follow-up studies testing GPT-5.3-Codex-Spark and o-series reasoning models on the same cohort. If accuracy does not cross 50% with clinical context, the benchmark-to-bedside gap may persist across model generations, pushing deployment timelines for AI dermatology to 2028 or later.

Source: gentic.news · 6h ago · author=Ala SMITH · citation.json

AI-assisted reporting. Generated by gentic.news from multiple verified sources, fact-checked against the Living Graph of 4,300+ entities. Edited by Ala SMITH.

Following this story?

Get a weekly digest with AI predictions, trends, and analysis — free.

AI Analysis

This study is the strongest evidence yet that dermatology MLLM benchmarks are saturated and misleading. The 5,811-case hospital cohort is an order of magnitude larger than typical public datasets, and the multi-site design reduces single-institution bias. The 10-20 percentage point drop between benchmarks and real-world performance mirrors patterns seen in radiology AI, where FDA-cleared models often degrade 15-30% in deployment. The brittleness to erroneous clinical context is the most worrying finding — it suggests these models are learning spurious correlations between image features and dataset-specific text patterns rather than robust diagnostic reasoning. The triage sensitivity above 60% is a silver lining but insufficient for autonomous use; screening with human oversight may be the only near-term path.

#ai in healthcare #llm evaluation #multimodal models #benchmarking

Compare side-by-side

GPT-4.1 vs SkinGPT4

→

Mentioned in this article

GPT-4.1 multimodal LLM SkinGPT4 MedGemma-4B-Instruct InternVL-Chat v1.5 LLaVA-Med v1.5 Roy Jiang Hyunjae Kim Zhenyue Qin

Enjoyed this article?

Get the weekly AI intelligence briefing

✨AI Toolslive

Five one-click lenses on this article. Cached for 24h.

Pick a tool above to generate an instant lens on this article.

AI Research

Qwen3.6-27B: How to Run a 17GB Local Model That Beats 397B MoE on Coding Tasks

From the lab

The framework underneath this story

Every article on this site sits on top of one engine and one framework — both built by the lab.

Original research · EUMAS 2026

MNEMA — A Witness Lattice for Multi-Agent AI Memory

Cryptographic memory units · 1−α detection floor · 15 pp PDF

Field framework · v1.0

Epistemic Infrastructure

12 pillars · 11-stage knowledge metabolism · pathology catalog

More in AI Research

View all

A diagram showing an AI model generating text explanations alongside its output, with arrows linking internal model…

AI Research

Microsoft Paper: AI Models Interpret Themselves Better Than Humans

Microsoft proposes self-interpretable AI models that beat human interpretability on 6 benchmarks, challenging the human-centric paradigm.

x.com/13h ago/3 min read

microsoftinterpretabilityai research

AI Research

OpenClaw-RL Trains AI Agents on Conversation Feedback Without Manual Labels

OpenClaw-RL trains AI agents on natural conversation feedback, removing manual labeling. Uses evaluative and directive signals for continuous learning.

x.com/14h ago/3 min read

ai-agentslanguage-modelsreinforcement-learning

NVIDIA and Unsloth engineers collaborate on a laptop, with code and performance graphs on screen showing a 25%…

AI Research

Unsloth × NVIDIA Cut LLM Fine-Tuning ~25% — Three Glue-Code Wins on Blackwell

Daniel & Michael Han at Unsloth, in collaboration with NVIDIA, published a joint guide quantifying three glue-code optimizations that combine for ~25% faster LLM training on B200 Blackwell hardware. The wins target overhead around the main kernels — caching packed-sequence metadata, double-buffered gradient checkpoint reloads, and a cheaper GPT-OSS MoE router using argsort + bincount. All three are merged via public PRs.

x.com/20h ago/3 min read

ml systemsunslothfine-tuning

Key Takeaways

The Benchmark-to-Bedside Gap

Triage Potential, Not Diagnostic Reliability

Brittle to Real-World Noise

What to watch

AI Analysis

✨AI Toolslive

Related Articles

Skills as Untrusted Code: A Security Precedent for Agent Runtimes

Claude Opus 4.7 Builds AlphaZero-Style Self-Play on Consumer Hardware

Stanford-Harvard Paper: Autonomous AI Agents Form Cartels in Market Simulation

Agentic Harness Engineering Boosts Coding Agents 7% on Terminal-Bench 2

Turn Claude Code Into an AI SRE

Qwen3.6-27B: How to Run a 17GB Local Model That Beats 397B MoE on Coding Tasks

The framework underneath this story

More in AI Research

Microsoft Paper: AI Models Interpret Themselves Better Than Humans

OpenClaw-RL Trains AI Agents on Conversation Feedback Without Manual Labels

Unsloth × NVIDIA Cut LLM Fine-Tuning ~25% — Three Glue-Code Wins on Blackwell