ReXInTheWild Benchmark Reveals VLMs Struggle with Medical Photos: Gemini-3 Leads at 78%, MedGemma Trails at 37%
Researchers have introduced ReXInTheWild, a new benchmark designed to evaluate how well vision-language models (VLMs) can interpret everyday medical photographs—the kind increasingly used in telemedicine and online health consultations. Published on arXiv, the benchmark addresses a critical gap: while general-purpose VLMs excel at recognizing objects in natural images, and specialized medical models are trained on radiology scans or pathology slides, no comprehensive test existed for analyzing casual clinical photos that require both fine-grained visual understanding and domain-specific medical reasoning.
The benchmark consists of 955 multiple-choice questions, each verified by clinicians, spanning seven clinical topics across 484 photographs sourced directly from biomedical literature. These aren't curated medical images but real-world photos showing conditions like skin lesions, wound infections, eye abnormalities, and oral health issues—exactly what patients might send to their doctors via telehealth platforms.
What the Researchers Built
ReXInTheWild (Real-world eXamples In The Wild) is specifically designed to test multimodal models at the intersection of natural image understanding and clinical reasoning. The researchers collected photographs from biomedical publications that represent actual clinical scenarios, then developed questions that require:
- Visual recognition: Identifying relevant anatomical structures, abnormalities, or clinical signs
- Medical knowledge: Understanding what those visual findings mean diagnostically
- Clinical reasoning: Making appropriate inferences about severity, next steps, or differential diagnoses
The 955 questions are distributed across seven clinical domains: dermatology (28%), ophthalmology (20%), dentistry/oral health (17%), wound care (13%), gynecology (10%), urology (7%), and gastroenterology (5%). Each question follows a multiple-choice format with four options, and all questions and answers were validated by board-certified physicians to ensure clinical accuracy.
Key Results
The benchmark reveals substantial performance variation among leading multimodal models:

Perhaps most surprisingly, MedGemma—a model specifically fine-tuned on medical data—performed significantly worse than all general-purpose models, achieving only 37% accuracy. This suggests that medical specialization alone doesn't guarantee competence with real-world clinical photographs if the training data doesn't include enough examples of everyday medical photography.
The researchers conducted a systematic error analysis that revealed four distinct failure categories:
- Geometric/visual errors (28%): Basic mistakes in recognizing objects, shapes, or spatial relationships in the image
- Medical knowledge errors (35%): Incorrect application of medical facts or concepts despite correct visual understanding
- Reasoning errors (22%): Flawed logical inference even with correct visual and factual information
- Question comprehension errors (15%**: Misunderstanding what the question is asking
How It Works
ReXInTheWild's construction followed a rigorous methodology to ensure clinical relevance and benchmarking utility:

Data Collection: The team sourced 484 photographs from peer-reviewed biomedical literature, focusing on images that represent real clinical scenarios rather than idealized textbook examples. These images were selected to cover a range of lighting conditions, angles, and photographic qualities that mirror what patients actually capture.
Question Generation: For each image, clinicians generated questions that test different levels of understanding:
- Level 1: Basic visual recognition ("What body part is shown?")
- Level 2: Pattern recognition ("What type of skin lesion is this?")
- Level 3: Diagnostic reasoning ("What is the most likely diagnosis?")
- Level 4: Management reasoning ("What would be the appropriate next step?")
Validation Process: All questions underwent multiple rounds of clinician review, with disagreements resolved through consensus discussion. The final benchmark includes only questions where clinicians unanimously agreed on the correct answer.
Evaluation Protocol: Models are presented with the image and question, then must select from four answer choices. The benchmark uses exact match accuracy as the primary metric, with additional analysis of error patterns across question types and clinical domains.
Why It Matters
ReXInTheWild addresses a critical real-world need: as telemedicine expands, AI systems that can accurately interpret patient-submitted photos could significantly improve healthcare access and efficiency. Current benchmarks like MedQA focus on text-based medical knowledge, while radiology benchmarks like CheXpert test interpretation of medical imaging—neither evaluates the specific challenges of everyday clinical photography.

The poor performance of MedGemma (37%) compared to general-purpose models highlights an important insight: medical AI specialization needs to include diverse data types. A model trained exclusively on formal medical imaging (X-rays, MRIs, pathology slides) may struggle with the visual characteristics of smartphone photos, which have different lighting, composition, and quality issues.
The benchmark's error categorization provides a roadmap for improvement. Geometric errors suggest models need better training on varied photographic conditions. Medical knowledge errors indicate gaps in clinical training data. Reasoning errors point to limitations in multimodal inference capabilities. Each requires different mitigation strategies.
gentic.news Analysis
The ReXInTheWild benchmark exposes a fundamental tension in medical AI development: the trade-off between domain specialization and general visual competence. MedGemma's surprisingly poor performance (37% vs. Gemini-3's 78%) suggests that medical fine-tuning on narrow datasets can actually degrade performance on real-world tasks if those datasets don't represent the full spectrum of visual inputs clinicians encounter. This has immediate implications for healthcare organizations building or deploying diagnostic AI: simply choosing a "medical" model may be insufficient without verifying its performance on the specific types of images your workflow generates.
From a technical perspective, the error analysis provides crucial guidance for model developers. The 28% geometric error rate indicates that even state-of-the-art VLMs struggle with the photographic artifacts common in patient-submitted images: poor lighting, motion blur, unusual angles, and variable image quality. This suggests that data augmentation strategies for medical VLMs need to go beyond standard transformations and include realistic photographic degradations. Meanwhile, the 35% medical knowledge error rate—despite these models having access to extensive medical literature—points to a deeper challenge: integrating visual evidence with clinical knowledge in a probabilistically sound way.
Looking forward, ReXInTheWild establishes a necessary reality check for the burgeoning field of medical AI assistants. As companies race to integrate multimodal capabilities into electronic health records and telehealth platforms, this benchmark provides the first rigorous test of whether these systems can handle the messy reality of patient-generated content. The substantial gap between the best model (78%) and perfect performance (100%) indicates there's significant work ahead before AI can reliably assist with photo-based triage or diagnosis. Developers should treat ReXInTheWild as a mandatory validation step before deploying any VLM for clinical photo interpretation.
Frequently Asked Questions
What is the ReXInTheWild benchmark?
ReXInTheWild is a benchmark dataset containing 955 clinician-verified multiple-choice questions based on 484 real medical photographs sourced from biomedical literature. It tests vision-language models' ability to interpret everyday clinical photos that combine natural image understanding with medical reasoning—exactly the type of images patients send during telemedicine consultations.
Why did MedGemma perform so poorly compared to general models?
MedGemma, a model specifically fine-tuned on medical data, achieved only 37% accuracy compared to Gemini-3's 78%. The researchers suggest this is because MedGemma was trained primarily on formal medical imaging (like X-rays and pathology slides) rather than diverse photographic examples. General-purpose VLMs have seen more varied visual data during pretraining, giving them better foundational abilities to handle the lighting, angles, and quality variations in patient photos.
How can developers use ReXInTheWild to improve their models?
Developers can use the benchmark both for evaluation and for targeted improvement. The error categorization (geometric, knowledge, reasoning, and comprehension errors) provides specific areas to address. Teams can fine-tune models on the ReXInTheWild training split, use the error analysis to identify weak points, and implement targeted data augmentation—particularly adding more varied photographic conditions to training data for geometric robustness.
Is 78% accuracy sufficient for clinical use?
No, 78% accuracy is insufficient for autonomous clinical decision-making where errors could harm patients. However, it represents a strong baseline that could potentially support clinicians as a second opinion or triage tool. The benchmark shows current models have significant room for improvement before they can be trusted with high-stakes medical interpretation of patient photos.




