Listen to today's AI briefing

Daily podcast — 5 min, AI-narrated summary of top stories

A bar chart comparing multimodal AI models on medical photo questions, with Gemini-3 leading at 78% accuracy and…

ReXInTheWild Benchmark Reveals VLMs Struggle with Medical Photos: Gemini-3 Leads at 78%, MedGemma Trails at 37%

Researchers introduced ReXInTheWild, a benchmark of 955 clinician-verified questions based on 484 real medical photographs. Leading multimodal models show wide performance gaps, with Gemini-3 scoring 78% accuracy while the specialized MedGemma model achieved only 37%.

AAAla SMITH & AI Research Desk·Mar 23, 2026·7 min read··125 views·AI-Generated·Report error

Source: arxiv.orgvia arxiv_cvSingle Source

Researchers have introduced ReXInTheWild, a new benchmark designed to evaluate how well vision-language models (VLMs) can interpret everyday medical photographs—the kind increasingly used in telemedicine and online health consultations. Published on arXiv, the benchmark addresses a critical gap: while general-purpose VLMs excel at recognizing objects in natural images, and specialized medical models are trained on radiology scans or pathology slides, no comprehensive test existed for analyzing casual clinical photos that require both fine-grained visual understanding and domain-specific medical reasoning.

The benchmark consists of 955 multiple-choice questions, each verified by clinicians, spanning seven clinical topics across 484 photographs sourced directly from biomedical literature. These aren't curated medical images but real-world photos showing conditions like skin lesions, wound infections, eye abnormalities, and oral health issues—exactly what patients might send to their doctors via telehealth platforms.

What the Researchers Built

ReXInTheWild (Real-world eXamples In The Wild) is specifically designed to test multimodal models at the intersection of natural image understanding and clinical reasoning. The researchers collected photographs from biomedical publications that represent actual clinical scenarios, then developed questions that require:

Visual recognition: Identifying relevant anatomical structures, abnormalities, or clinical signs
Medical knowledge: Understanding what those visual findings mean diagnostically
Clinical reasoning: Making appropriate inferences about severity, next steps, or differential diagnoses

The 955 questions are distributed across seven clinical domains: dermatology (28%), ophthalmology (20%), dentistry/oral health (17%), wound care (13%), gynecology (10%), urology (7%), and gastroenterology (5%). Each question follows a multiple-choice format with four options, and all questions and answers were validated by board-certified physicians to ensure clinical accuracy.

Key Results

The benchmark reveals substantial performance variation among leading multimodal models:

Figure 3: Large general-purpose models, especially Gemini-3, outperformed MedGemma, a smaller medical MLLM. Performance

Gemini-3 78% Best performing general-purpose VLM Claude Opus 4.5 72% Strong performance but 6% behind leader GPT-5 68% Third among general-purpose models MedGemma 37% Medical specialist model performed worst

Perhaps most surprisingly, MedGemma—a model specifically fine-tuned on medical data—performed significantly worse than all general-purpose models, achieving only 37% accuracy. This suggests that medical specialization alone doesn't guarantee competence with real-world clinical photographs if the training data doesn't include enough examples of everyday medical photography.

The researchers conducted a systematic error analysis that revealed four distinct failure categories:

Geometric/visual errors (28%): Basic mistakes in recognizing objects, shapes, or spatial relationships in the image
Medical knowledge errors (35%): Incorrect application of medical facts or concepts despite correct visual understanding
Reasoning errors (22%): Flawed logical inference even with correct visual and factual information
Question comprehension errors (15%**: Misunderstanding what the question is asking

How It Works

ReXInTheWild's construction followed a rigorous methodology to ensure clinical relevance and benchmarking utility:

(b) Example question-answer pairs from ReXInTheWild

Data Collection: The team sourced 484 photographs from peer-reviewed biomedical literature, focusing on images that represent real clinical scenarios rather than idealized textbook examples. These images were selected to cover a range of lighting conditions, angles, and photographic qualities that mirror what patients actually capture.

Question Generation: For each image, clinicians generated questions that test different levels of understanding:

Level 1: Basic visual recognition ("What body part is shown?")
Level 2: Pattern recognition ("What type of skin lesion is this?")
Level 3: Diagnostic reasoning ("What is the most likely diagnosis?")
Level 4: Management reasoning ("What would be the appropriate next step?")

Validation Process: All questions underwent multiple rounds of clinician review, with disagreements resolved through consensus discussion. The final benchmark includes only questions where clinicians unanimously agreed on the correct answer.

Evaluation Protocol: Models are presented with the image and question, then must select from four answer choices. The benchmark uses exact match accuracy as the primary metric, with additional analysis of error patterns across question types and clinical domains.

Why It Matters

ReXInTheWild addresses a critical real-world need: as telemedicine expands, AI systems that can accurately interpret patient-submitted photos could significantly improve healthcare access and efficiency. Current benchmarks like MedQA focus on text-based medical knowledge, while radiology benchmarks like CheXpert test interpretation of medical imaging—neither evaluates the specific challenges of everyday clinical photography.

Figure 2: The ReXInTheWild benchmark construction pipeline. Stage A: “Image Selection

The poor performance of MedGemma (37%) compared to general-purpose models highlights an important insight: medical AI specialization needs to include diverse data types. A model trained exclusively on formal medical imaging (X-rays, MRIs, pathology slides) may struggle with the visual characteristics of smartphone photos, which have different lighting, composition, and quality issues.

The benchmark's error categorization provides a roadmap for improvement. Geometric errors suggest models need better training on varied photographic conditions. Medical knowledge errors indicate gaps in clinical training data. Reasoning errors point to limitations in multimodal inference capabilities. Each requires different mitigation strategies.

gentic.news Analysis

The ReXInTheWild benchmark exposes a fundamental tension in medical AI development: the trade-off between domain specialization and general visual competence. MedGemma's surprisingly poor performance (37% vs. Gemini-3's 78%) suggests that medical fine-tuning on narrow datasets can actually degrade performance on real-world tasks if those datasets don't represent the full spectrum of visual inputs clinicians encounter. This has immediate implications for healthcare organizations building or deploying diagnostic AI: simply choosing a "medical" model may be insufficient without verifying its performance on the specific types of images your workflow generates.

From a technical perspective, the error analysis provides crucial guidance for model developers. The 28% geometric error rate indicates that even state-of-the-art VLMs struggle with the photographic artifacts common in patient-submitted images: poor lighting, motion blur, unusual angles, and variable image quality. This suggests that data augmentation strategies for medical VLMs need to go beyond standard transformations and include realistic photographic degradations. Meanwhile, the 35% medical knowledge error rate—despite these models having access to extensive medical literature—points to a deeper challenge: integrating visual evidence with clinical knowledge in a probabilistically sound way.

Looking forward, ReXInTheWild establishes a necessary reality check for the burgeoning field of medical AI assistants. As companies race to integrate multimodal capabilities into electronic health records and telehealth platforms, this benchmark provides the first rigorous test of whether these systems can handle the messy reality of patient-generated content. The substantial gap between the best model (78%) and perfect performance (100%) indicates there's significant work ahead before AI can reliably assist with photo-based triage or diagnosis. Developers should treat ReXInTheWild as a mandatory validation step before deploying any VLM for clinical photo interpretation.

Frequently Asked Questions

What is the ReXInTheWild benchmark?

ReXInTheWild is a benchmark dataset containing 955 clinician-verified multiple-choice questions based on 484 real medical photographs sourced from biomedical literature. It tests vision-language models' ability to interpret everyday clinical photos that combine natural image understanding with medical reasoning—exactly the type of images patients send during telemedicine consultations.

Why did MedGemma perform so poorly compared to general models?

MedGemma, a model specifically fine-tuned on medical data, achieved only 37% accuracy compared to Gemini-3's 78%. The researchers suggest this is because MedGemma was trained primarily on formal medical imaging (like X-rays and pathology slides) rather than diverse photographic examples. General-purpose VLMs have seen more varied visual data during pretraining, giving them better foundational abilities to handle the lighting, angles, and quality variations in patient photos.

How can developers use ReXInTheWild to improve their models?

Developers can use the benchmark both for evaluation and for targeted improvement. The error categorization (geometric, knowledge, reasoning, and comprehension errors) provides specific areas to address. Teams can fine-tune models on the ReXInTheWild training split, use the error analysis to identify weak points, and implement targeted data augmentation—particularly adding more varied photographic conditions to training data for geometric robustness.

Is 78% accuracy sufficient for clinical use?

No, 78% accuracy is insufficient for autonomous clinical decision-making where errors could harm patients. However, it represents a strong baseline that could potentially support clinicians as a second opinion or triage tool. The benchmark shows current models have significant room for improvement before they can be trusted with high-stakes medical interpretation of patient photos.

Source: gentic.news · Mar 23, 2026 · author=Ala SMITH · citation.json

AI-assisted reporting. Generated by gentic.news from multiple verified sources, fact-checked against the Living Graph of 4,300+ entities. Edited by Ala SMITH.

Following this story?

Get a weekly digest with AI predictions, trends, and analysis — free.

AI Analysis

The ReXInTheWild benchmark represents a crucial step toward realistic evaluation of medical AI systems. For years, the field has relied on benchmarks that test either pure medical knowledge (MedQA) or interpretation of formal medical imaging (CheXpert), while ignoring the growing reality of patient-submitted photos in telehealth. This disconnect between benchmark tasks and real clinical workflows has allowed models to claim medical competence without demonstrating ability on actual patient data. The most significant finding isn't that models struggle—that was expected—but the specific pattern of failures and the surprising underperformance of the specialized medical model. MedGemma's 37% accuracy, less than half of Gemini-3's 78%, should serve as a warning to the entire medical AI industry: domain specialization through fine-tuning on narrow datasets can create brittle models that fail on distribution shifts. This echoes similar findings in other AI domains where over-specialization reduces robustness. Practically, ReXInTheWild provides exactly the kind of testbed needed to drive meaningful progress. The error categorization gives researchers clear targets: nearly one-third of errors are basic visual mistakes (geometric errors), suggesting models need better pretraining on varied photographic conditions. Another third are medical knowledge errors, indicating that even with access to medical literature, models struggle to apply that knowledge correctly to visual evidence. This points toward the need for more sophisticated multimodal reasoning architectures that can properly weigh visual and textual evidence. For healthcare organizations evaluating AI tools, ReXInTheWild offers a critical litmus test. Any vendor claiming their VLM can interpret patient photos should be asked for their ReXInTheWild score. The substantial gap between current performance and clinical requirements (likely >95% for diagnostic use) means we're still years away from autonomous photo interpretation, but the benchmark now gives us a way to measure progress toward that goal.

#healthcare-technology #medical-ai #multimodal-ai #benchmarks #computer-vision

Mentioned in this article

ReXInTheWild MedGemma-4B Vision-Language Models arXiv

Enjoyed this article?

Get the weekly AI intelligence briefing

✨AI Toolslive

Five one-click lenses on this article. Cached for 24h.

Pick a tool above to generate an instant lens on this article.

AI Research2 shared topics

New Benchmark and Methods Target Few-Shot Text-to-Image Retrieval for Complex Queries

AI Research2 shared topics

Gastric-X: New 1.7K-Case Multimodal Benchmark Challenges VLMs on Realistic Gastric Cancer Diagnosis Workflow

AI Research2 shared topics

VLM2Rec: A New Framework to Fix 'Modality Collapse' in Multimodal Recommendation Systems

From the lab

The framework underneath this story

Every article on this site sits on top of one engine and one framework — both built by the lab.

Original research · EUMAS 2026

MNEMA — A Witness Lattice for Multi-Agent AI Memory

Cryptographic memory units · 1−α detection floor · 15 pp PDF

Field framework · v1.0

Epistemic Infrastructure

12 pillars · 11-stage knowledge metabolism · pathology catalog

More in AI Research

View all

Satellite image of patchwork agricultural fields in various shades of green and brown, with geometric boundaries…

AI Research

Prithvi-EO Fails Cross-Country Crop Yield Generalization, Paper Shows

Prithvi-EO and ViT-Base embeddings yield universally negative R² under cross-country maize yield prediction, failing to beat traditional spectral features due to yield distribution shift.

arxiv.org/22h ago/3 min read

earth-observationfoundation-modelsarxiv

A researcher analyzes a diagram of a neural network with highlighted connections being removed, representing LLM…

AI Research

Pruning LLMs for Edge Triples Bias, Perplexity Hides Damage

Pruning LLMs for edge deployment amplifies bias up to 83.7% while perplexity barely changes, revealing a paradox that undermines standard evaluation practices.

arxiv.org/22h ago/3 min read/Multi-Source

ai safetymodel compressionedge ai

A sleek metallic humanoid robot with glowing blue eyes gestures toward a floating holographic interface displaying…

AI Research

Thinking Machines Unveils Native Multimodal Interaction Model

Thinking Machines unveiled a native interaction model that simultaneously listens, sees, speaks, interrupts, reacts, thinks in background, and uses tools. The approach targets the fundamental turn-based bottleneck of current AI assistants.

x.com/1d ago/3 min read

startupsai modelsmultimodal ai

What the Researchers Built

Key Results

How It Works

Why It Matters

gentic.news Analysis

Frequently Asked Questions

What is the ReXInTheWild benchmark?

Why did MedGemma perform so poorly compared to general models?

How can developers use ReXInTheWild to improve their models?

Is 78% accuracy sufficient for clinical use?

AI Analysis

✨AI Toolslive

Related Articles

New Benchmark and Methods Target Few-Shot Text-to-Image Retrieval for Complex Queries

Gastric-X: New 1.7K-Case Multimodal Benchmark Challenges VLMs on Realistic Gastric Cancer Diagnosis Workflow

VLM2Rec: A New Framework to Fix 'Modality Collapse' in Multimodal Recommendation Systems

The framework underneath this story

More in AI Research

Prithvi-EO Fails Cross-Country Crop Yield Generalization, Paper Shows

Pruning LLMs for Edge Triples Bias, Perplexity Hides Damage

Thinking Machines Unveils Native Multimodal Interaction Model