Gastric-X: A Multimodal Multi-Phase Benchmark Dataset for Advancing Vision-Language Models in Gastric Cancer Analysis
A new benchmark dataset called Gastric-X has been introduced to systematically evaluate and advance vision-language models (VLMs) for clinical applications in gastric cancer analysis. Published on arXiv, the dataset provides 1,700 comprehensive clinical cases designed to mirror real diagnostic workflows, moving beyond simple image-text pairs to incorporate the multimodal evidence physicians actually use.
What the Researchers Built: A Clinically Realistic Multimodal Dataset
The core contribution is Gastric-X itself—a structured collection that captures multiple phases of gastric cancer diagnosis. Each of the 1.7K cases includes:
- Paired CT Scans: Both resting and dynamic (contrast-enhanced) CT scans, providing spatial and functional tumor information.
- Endoscopic Image: A direct visual of the gastric mucosa.
- Structured Biochemical Indicators: A set of lab values (e.g., tumor markers, blood counts) as structured data.
- Expert-Authored Diagnostic Notes: Textual reports written by clinicians, summarizing findings and reasoning.
- Bounding Box Annotations: Precise localization of tumor regions in the imaging data.
This combination is designed to reflect the "evidential reasoning" of a physician, who synthesizes imaging, lab work, and clinical notes to form a diagnosis, rather than relying on a single data modality.
The Five Core Evaluation Tasks
The benchmark defines five tasks to systematically probe VLM capabilities across a simulated clinical workflow:

- Visual Question Answering (VQA): Answering diagnostic questions based on provided images and data.
- Report Generation: Generating a diagnostic summary from the multimodal inputs.
- Cross-Modal Retrieval: Retrieving relevant cases or evidence across different data types (e.g., finding images with similar lab profiles).
- Disease Classification: Categorizing the case (e.g., cancer stage, subtype).
- Lesion Localization: Identifying and localizing tumors within the images.
These tasks progress from basic understanding (VQA) to complex synthesis (report generation) and decision support (retrieval, classification).
The Central Research Question: Can VLMs Correlate Signals Like a Doctor?
The paper positions Gastric-X not just as a performance leaderboard, but as a tool to investigate a fundamental question: Can current VLMs meaningfully correlate discrete biochemical signals with spatial tumor features and unstructured textual reports?

This gets at the heart of clinical reasoning. For example, an elevated CA 19-9 tumor marker should be associated with specific imaging characteristics described in a radiology note. The benchmark is designed to test whether models learn these nuanced, cross-modal associations or merely perform surface-level pattern matching.
Availability and Intended Use
Gastric-X is presented as a resource to "inspire the development of next-generation medical VLMs." As an arXiv preprint, the dataset and evaluation code are likely intended for public release to facilitate research, aligning with arXiv's role as an open-access repository for scientific communication prior to formal peer review.

gentic.news Analysis
Gastric-X represents a necessary and sophisticated escalation in medical AI benchmarking. For years, the field has relied on narrow tasks like labeling a single X-ray or classifying a skin lesion from a photo. This dataset correctly identifies that real diagnosis is a multimodal retrieval and reasoning problem. A physician doesn't just look at a scan; they contextualize it with lab results, prior reports, and clinical guidelines. By forcing models to handle paired CT phases, endoscopy, structured lab values, and expert text simultaneously, Gastric-X exposes a critical weakness in current VLMs: they are predominantly trained on internet-scale image-text pairs, which lack the structured, temporally-aligned, and domain-specific correlations found in medicine.
The inclusion of structured biochemical indicators is particularly astute. Most VLMs treat all input as pixels or tokens, struggling with discrete, numerical data that has precise clinical meaning. A model that can't reason that a plummeting hemoglobin might correlate with a bleeding tumor seen on CT has failed a basic clinical logic test. This benchmark will likely reveal that current state-of-the-art general VLMs perform poorly, creating a clear market gap for medically-pretrained or hybrid architectures that can embed lab values and imaging features into a joint, semantically meaningful space.
Furthermore, Gastric-X subtly challenges the prevailing trend of simply scaling up model parameters and training data from the web. It suggests that for high-stakes domains like oncology, curated, task-specific multimodal reasoning is the next frontier. Success here won't come from a bigger GPT, but from architectures that can emulate the hypothesis-driven, evidence-weaving process captured in this dataset's design. It sets a new standard for what a "comprehensive" medical AI benchmark should look like.
Frequently Asked Questions
What is the Gastric-X dataset?
Gastric-X is a large-scale multimodal benchmark dataset for gastric cancer analysis. It contains 1,700 clinical cases, each comprising paired CT scans (resting and dynamic), an endoscopic image, structured biochemical lab values, expert-authored diagnostic notes, and annotations of tumor regions. It is designed to evaluate AI models on tasks that simulate a real clinical diagnostic workflow.
What AI models is Gastric-X designed to evaluate?
The benchmark is specifically designed to evaluate Vision-Language Models (VLMs). These are AI models that can process and understand both visual data (like medical images) and textual data (like clinical notes). The dataset tests whether these models can perform integrated, clinically-relevant tasks such as answering questions based on images and data, generating reports, and retrieving related information across different types of medical data.
How is Gastric-X different from other medical AI datasets?
Unlike many medical datasets that focus on a single type of data (e.g., only CT images or only pathology reports), Gastric-X is multimodal and multi-phase. It integrates multiple, complementary data sources from a single patient case that a doctor would actually review. This includes different types of scans, numerical lab results, and textual notes, making it a more realistic and challenging testbed for AI systems aiming to assist in complex diagnosis.
Where can I find the Gastric-X dataset and paper?
The research paper introducing Gastric-X is available on arXiv, an open-access repository for scientific preprints. The paper is titled "Gastric-X: A Multimodal Multi-Phase Benchmark Dataset for Advancing Vision-Language Models in Gastric Cancer Analysis" and its identifier is arXiv:2603.19516v1. The dataset itself is expected to be released by the authors to accompany the paper, following common practice in the research community.



