Zhang et al. tested four fusion strategies on 7,022 patients and found no single winner. The June 2026 arXiv paper shows contrastive alignment with CLMBR dominates for PE mortality, while cross-attention and co-attention variants split leadership on CVD outcomes.
Key facts
- 4 fusion strategies tested: late, contrastive, cross-attention, co-attention
- 7,022 total patients across PE and CVD cohorts
- Fusion improves concordance index by 1.5-5.4% over unimodal
- CLMBR contrastive fusion best for PE mortality
- Cross-attention and co-attention split leadership for CVD
A new paper from Zhemin Zhang, Weijie Chen, David Le, and colleagues, posted on arXiv on June 13, 2026, systematically compares four multimodal fusion strategies for time-to-event (TTE) prediction using CT imaging and longitudinal EHR data. The work evaluates on two clinically distinct tasks: pulmonary embolism (PE) mortality (N=3,099 train; 1,098 internal; 435 external) and cardiovascular disease (CVD) outcomes (N=2,951 train; 837 internal; 682 external) According to the arXiv preprint.
The four fusion strategies
The framework encodes CT and EHR modalities independently using domain-specific foundation models, then aligns them in a shared latent space via late fusion, contrastive alignment, cross-attention, and co-attention. The authors report that fusion consistently improves concordance index by 1.5-5.4% over unimodal baselines when modalities contribute comparably. However, performance varies sharply by task: contrastive multimodal fusion with CLMBR representations gave the most consistent and statistically robust improvements for PE mortality prediction. For MACE (major adverse cardiovascular events), cross-attention with one-hot encoding achieved the highest internal performance, while image-guided co-attention led on external test sets.
no universal fusion recipe
The paper's central finding is that multimodal fusion strategy must be task-aware. There is no one-size-fits-all approach. Modality imbalance — where one modality dominates predictive signal — shifts which fusion mechanism works best. The authors provide the first systematic analysis of this behavior in TTE prediction, establishing task-aware alignment as a necessary design principle for robust generalization and scalable clinical deployment.
Related work and context
This work follows a pattern in recent clinical AI research: the June 2026 arXiv paper on strategic attack timing similarly tested multiple strategies across tasks and found no universal winner. The broader lesson is that foundation model alignment for healthcare requires task-specific tuning, not a single architecture.
Key Takeaways
- test 4 fusion strategies on 7K+ patients, finding no universal best.
- Contrastive alignment with CLMBR wins for PE mortality; cross-attention and co-attention split for CVD.
What to watch
Watch for follow-up work testing these fusion strategies on additional tasks (e.g., sepsis, cancer survival) and modalities (e.g., genomics, pathology slides). If the pattern holds, clinical AI deployment will require task-specific fusion selection rather than a single architecture — raising complexity for regulatory approval.

Source: arxiv.org









