ReXInTheWild Benchmark Reveals VLMs Struggle with Medical Photos: Gemini-3 Leads at 78%, MedGemma Trails at 37%
AI ResearchScore: 75

ReXInTheWild Benchmark Reveals VLMs Struggle with Medical Photos: Gemini-3 Leads at 78%, MedGemma Trails at 37%

Researchers introduced ReXInTheWild, a benchmark of 955 clinician-verified questions based on 484 real medical photographs. Leading multimodal models show wide performance gaps, with Gemini-3 scoring 78% accuracy while the specialized MedGemma model achieved only 37%.

Ggentic.news Editorial·1d ago·7 min read·7 views·via arxiv_cv
Share:

ReXInTheWild Benchmark Reveals VLMs Struggle with Medical Photos: Gemini-3 Leads at 78%, MedGemma Trails at 37%

Researchers have introduced ReXInTheWild, a new benchmark designed to evaluate how well vision-language models (VLMs) can interpret everyday medical photographs—the kind increasingly used in telemedicine and online health consultations. Published on arXiv, the benchmark addresses a critical gap: while general-purpose VLMs excel at recognizing objects in natural images, and specialized medical models are trained on radiology scans or pathology slides, no comprehensive test existed for analyzing casual clinical photos that require both fine-grained visual understanding and domain-specific medical reasoning.

The benchmark consists of 955 multiple-choice questions, each verified by clinicians, spanning seven clinical topics across 484 photographs sourced directly from biomedical literature. These aren't curated medical images but real-world photos showing conditions like skin lesions, wound infections, eye abnormalities, and oral health issues—exactly what patients might send to their doctors via telehealth platforms.

What the Researchers Built

ReXInTheWild (Real-world eXamples In The Wild) is specifically designed to test multimodal models at the intersection of natural image understanding and clinical reasoning. The researchers collected photographs from biomedical publications that represent actual clinical scenarios, then developed questions that require:

  1. Visual recognition: Identifying relevant anatomical structures, abnormalities, or clinical signs
  2. Medical knowledge: Understanding what those visual findings mean diagnostically
  3. Clinical reasoning: Making appropriate inferences about severity, next steps, or differential diagnoses

The 955 questions are distributed across seven clinical domains: dermatology (28%), ophthalmology (20%), dentistry/oral health (17%), wound care (13%), gynecology (10%), urology (7%), and gastroenterology (5%). Each question follows a multiple-choice format with four options, and all questions and answers were validated by board-certified physicians to ensure clinical accuracy.

Key Results

The benchmark reveals substantial performance variation among leading multimodal models:

Figure 3: Large general-purpose models, especially Gemini-3, outperformed MedGemma, a smaller medical MLLM. Performance

Gemini-3 78% Best performing general-purpose VLM Claude Opus 4.5 72% Strong performance but 6% behind leader GPT-5 68% Third among general-purpose models MedGemma 37% Medical specialist model performed worst

Perhaps most surprisingly, MedGemma—a model specifically fine-tuned on medical data—performed significantly worse than all general-purpose models, achieving only 37% accuracy. This suggests that medical specialization alone doesn't guarantee competence with real-world clinical photographs if the training data doesn't include enough examples of everyday medical photography.

The researchers conducted a systematic error analysis that revealed four distinct failure categories:

  1. Geometric/visual errors (28%): Basic mistakes in recognizing objects, shapes, or spatial relationships in the image
  2. Medical knowledge errors (35%): Incorrect application of medical facts or concepts despite correct visual understanding
  3. Reasoning errors (22%): Flawed logical inference even with correct visual and factual information
  4. Question comprehension errors (15%**: Misunderstanding what the question is asking

How It Works

ReXInTheWild's construction followed a rigorous methodology to ensure clinical relevance and benchmarking utility:

(b) Example question-answer pairs from ReXInTheWild

Data Collection: The team sourced 484 photographs from peer-reviewed biomedical literature, focusing on images that represent real clinical scenarios rather than idealized textbook examples. These images were selected to cover a range of lighting conditions, angles, and photographic qualities that mirror what patients actually capture.

Question Generation: For each image, clinicians generated questions that test different levels of understanding:

  • Level 1: Basic visual recognition ("What body part is shown?")
  • Level 2: Pattern recognition ("What type of skin lesion is this?")
  • Level 3: Diagnostic reasoning ("What is the most likely diagnosis?")
  • Level 4: Management reasoning ("What would be the appropriate next step?")

Validation Process: All questions underwent multiple rounds of clinician review, with disagreements resolved through consensus discussion. The final benchmark includes only questions where clinicians unanimously agreed on the correct answer.

Evaluation Protocol: Models are presented with the image and question, then must select from four answer choices. The benchmark uses exact match accuracy as the primary metric, with additional analysis of error patterns across question types and clinical domains.

Why It Matters

ReXInTheWild addresses a critical real-world need: as telemedicine expands, AI systems that can accurately interpret patient-submitted photos could significantly improve healthcare access and efficiency. Current benchmarks like MedQA focus on text-based medical knowledge, while radiology benchmarks like CheXpert test interpretation of medical imaging—neither evaluates the specific challenges of everyday clinical photography.

Figure 2: The ReXInTheWild benchmark construction pipeline. Stage A: “Image Selection

The poor performance of MedGemma (37%) compared to general-purpose models highlights an important insight: medical AI specialization needs to include diverse data types. A model trained exclusively on formal medical imaging (X-rays, MRIs, pathology slides) may struggle with the visual characteristics of smartphone photos, which have different lighting, composition, and quality issues.

The benchmark's error categorization provides a roadmap for improvement. Geometric errors suggest models need better training on varied photographic conditions. Medical knowledge errors indicate gaps in clinical training data. Reasoning errors point to limitations in multimodal inference capabilities. Each requires different mitigation strategies.

gentic.news Analysis

The ReXInTheWild benchmark exposes a fundamental tension in medical AI development: the trade-off between domain specialization and general visual competence. MedGemma's surprisingly poor performance (37% vs. Gemini-3's 78%) suggests that medical fine-tuning on narrow datasets can actually degrade performance on real-world tasks if those datasets don't represent the full spectrum of visual inputs clinicians encounter. This has immediate implications for healthcare organizations building or deploying diagnostic AI: simply choosing a "medical" model may be insufficient without verifying its performance on the specific types of images your workflow generates.

From a technical perspective, the error analysis provides crucial guidance for model developers. The 28% geometric error rate indicates that even state-of-the-art VLMs struggle with the photographic artifacts common in patient-submitted images: poor lighting, motion blur, unusual angles, and variable image quality. This suggests that data augmentation strategies for medical VLMs need to go beyond standard transformations and include realistic photographic degradations. Meanwhile, the 35% medical knowledge error rate—despite these models having access to extensive medical literature—points to a deeper challenge: integrating visual evidence with clinical knowledge in a probabilistically sound way.

Looking forward, ReXInTheWild establishes a necessary reality check for the burgeoning field of medical AI assistants. As companies race to integrate multimodal capabilities into electronic health records and telehealth platforms, this benchmark provides the first rigorous test of whether these systems can handle the messy reality of patient-generated content. The substantial gap between the best model (78%) and perfect performance (100%) indicates there's significant work ahead before AI can reliably assist with photo-based triage or diagnosis. Developers should treat ReXInTheWild as a mandatory validation step before deploying any VLM for clinical photo interpretation.

Frequently Asked Questions

What is the ReXInTheWild benchmark?

ReXInTheWild is a benchmark dataset containing 955 clinician-verified multiple-choice questions based on 484 real medical photographs sourced from biomedical literature. It tests vision-language models' ability to interpret everyday clinical photos that combine natural image understanding with medical reasoning—exactly the type of images patients send during telemedicine consultations.

Why did MedGemma perform so poorly compared to general models?

MedGemma, a model specifically fine-tuned on medical data, achieved only 37% accuracy compared to Gemini-3's 78%. The researchers suggest this is because MedGemma was trained primarily on formal medical imaging (like X-rays and pathology slides) rather than diverse photographic examples. General-purpose VLMs have seen more varied visual data during pretraining, giving them better foundational abilities to handle the lighting, angles, and quality variations in patient photos.

How can developers use ReXInTheWild to improve their models?

Developers can use the benchmark both for evaluation and for targeted improvement. The error categorization (geometric, knowledge, reasoning, and comprehension errors) provides specific areas to address. Teams can fine-tune models on the ReXInTheWild training split, use the error analysis to identify weak points, and implement targeted data augmentation—particularly adding more varied photographic conditions to training data for geometric robustness.

Is 78% accuracy sufficient for clinical use?

No, 78% accuracy is insufficient for autonomous clinical decision-making where errors could harm patients. However, it represents a strong baseline that could potentially support clinicians as a second opinion or triage tool. The benchmark shows current models have significant room for improvement before they can be trusted with high-stakes medical interpretation of patient photos.

AI Analysis

The ReXInTheWild benchmark represents a crucial step toward realistic evaluation of medical AI systems. For years, the field has relied on benchmarks that test either pure medical knowledge (MedQA) or interpretation of formal medical imaging (CheXpert), while ignoring the growing reality of patient-submitted photos in telehealth. This disconnect between benchmark tasks and real clinical workflows has allowed models to claim medical competence without demonstrating ability on actual patient data. The most significant finding isn't that models struggle—that was expected—but the specific pattern of failures and the surprising underperformance of the specialized medical model. MedGemma's 37% accuracy, less than half of Gemini-3's 78%, should serve as a warning to the entire medical AI industry: domain specialization through fine-tuning on narrow datasets can create brittle models that fail on distribution shifts. This echoes similar findings in other AI domains where over-specialization reduces robustness. Practically, ReXInTheWild provides exactly the kind of testbed needed to drive meaningful progress. The error categorization gives researchers clear targets: nearly one-third of errors are basic visual mistakes (geometric errors), suggesting models need better pretraining on varied photographic conditions. Another third are medical knowledge errors, indicating that even with access to medical literature, models struggle to apply that knowledge correctly to visual evidence. This points toward the need for more sophisticated multimodal reasoning architectures that can properly weigh visual and textual evidence. For healthcare organizations evaluating AI tools, ReXInTheWild offers a critical litmus test. Any vendor claiming their VLM can interpret patient photos should be asked for their ReXInTheWild score. The substantial gap between current performance and clinical requirements (likely >95% for diagnostic use) means we're still years away from autonomous photo interpretation, but the benchmark now gives us a way to measure progress toward that goal.
Original sourcearxiv.org

Trending Now

More in AI Research

View all