multi modal

30 articles about multi modal in AI news

SingGuard: Runtime Guardrails for Multimodal AI Treat Safety as Input

SingGuard treats safety rules as runtime inputs for multimodal AI, achieving SOTA across 6 families and 35 datasets via fast/slow reasoning.

Jun 30, 202685% relevant

Apple AFM Core Advanced: Sparse, Multimodal, iPhone 17 Pro Only

Apple AFM Core Advanced is sparse, multimodal, and exclusive to iPhone 17 Pro, M3+ Mac, M4+ iPad, while AFM Core is dense for other devices.

Jun 8, 202683% relevant

Google Gemma 4 12B: Encoder-Free Multimodal Model Launches

Google launched Gemma 4 12B, an encoder-free multimodal model for on-device AI, reducing latency by eliminating the vision encoder.

Jun 3, 2026100% relevant

MiniMax M3: Sparse Attention, 1M Context, Multimodal via Together

MiniMax M3 uses sparse attention for 1M context and multimodality, with Together AI serving fast inference.

Jun 3, 202695% relevant

ByteDance Open-Sources BAGEL: 7B Multimodal Model for Image Gen, Editing, Understanding

ByteDance open-sourced BAGEL, a 7B multimodal model for image gen, editing, style transfer, and understanding under Apache 2.0.

May 28, 202695% relevant

ByteDance Lance 3B MoE Beats 7B Models on Multimodal Benchmarks

ByteDance released Lance, a 3B multimodal MoE model that beats 7B+ models on benchmarks through multi-task synergy and specialized pathways.

May 19, 202690% relevant

Odyssey Launches Starchild-1, First Real-Time Multimodal World Model

Odyssey AI released Starchild-1, first real-time multimodal world model for video generation targeting embodied AI and robotics.

May 18, 202695% relevant

DataArc-SynData-Toolkit: Open-Source Framework for Multimodal Synthetic Data

DataArc-SynData-Toolkit is an open-source framework for multimodal synthetic data, aiming to lower technical barriers for LLM training. It features a configuration-driven pipeline with visual interface and modular architecture.

May 12, 202670% relevant

Meta Tuna-2: Encoder-Free Multimodal Model Beats VAE-Based Rivals

Meta released Tuna-2, an encoder-free multimodal model that understands and generates images from raw pixels. It beats encoder-based models on fine-grained perception benchmarks, challenging the dominant VAE/vision encoder paradigm.

Apr 28, 202690% relevant

NVIDIA Nemotron 3 Nano Omni: Open Multimodal Model Unifies Video, Audio, Image, Text

NVIDIA announced Nemotron 3 Nano Omni, an open multimodal model that processes video, audio, images, and text in a unified architecture, expanding accessibility for multimodal AI research.

Apr 28, 202693% relevant

Tencent Open-Sources HY-World 2.0 Multimodal 3D World Model

Tencent's Hunyuan AI lab has open-sourced HY-World 2.0, a multimodal world model capable of generating, reconstructing, and simulating interactive 3D scenes. This release provides a significant, freely available tool for 3D content creation and embodied AI research.

Apr 17, 202685% relevant

Indexing Multimodal LLMs for Large-Scale Image Retrieval

A new arXiv paper proposes using Multimodal LLMs (MLLMs) for instance-level image-to-image retrieval. By prompting models with paired images and converting next-token probabilities into scores, the method enables training-free re-ranking. It shows superior robustness to clutter and occlusion compared to specialized models, though struggles with severe appearance changes.

Apr 16, 202672% relevant

MedGemma 1.5 Technical Report Released, Details Multimodal Medical AI

Google DeepMind has published the technical report for MedGemma 1.5, detailing the architecture and capabilities of its open-source, multimodal medical AI model. This follows the initial Med-PaLM 2 release and represents a significant step in making specialized medical AI more accessible.

Apr 9, 202685% relevant

JBM-Diff: A New Graph Diffusion Model for Denoising Multimodal Recommendations

A new arXiv paper introduces JBM-Diff, a conditional graph diffusion model designed to clean 'noise' from multimodal item features (like images/text) and user behavior data (like accidental clicks) in recommendation systems. It aims to improve ranking accuracy by ensuring only preference-relevant signals are used.

Apr 7, 202678% relevant

Building a Multimodal Product Similarity Engine for Fashion Retail

The source presents a practical guide to constructing a product similarity engine for fashion retail. It focuses on using multimodal embeddings from text and images to find similar items, a core capability for recommendations and search.

Apr 5, 202696% relevant

Google Releases Gemma 4 Family Under Apache 2.0, Featuring 2B to 31B Models with MoE and Multimodal Capabilities

Google has released the Gemma 4 family of open-weight models, derived from Gemini 3 technology. The four models, ranging from 2B to 31B parameters and including a Mixture-of-Experts variant, are available under a permissive Apache 2.0 license and feature multimodal processing.

Apr 2, 2026100% relevant

mmAnomaly: New Multi-Modal Framework Uses Conditional Latent Diffusion to Achieve 94% F1 Score for mmWave Anomaly Detection

Researchers introduced mmAnomaly, a multi-modal anomaly detection system that uses a conditional latent diffusion model to synthesize expected mmWave spectra from visual context, achieving up to a 94% F1 score for detecting concealed weapons and through-wall anomalies.

Apr 2, 202672% relevant

Uni-SafeBench Study: Unified Multimodal Models Show 30-50% Higher Safety Failure Rates Than Specialized Counterparts

Researchers introduced Uni-SafeBench, a benchmark showing that Unified Multimodal Large Models (UMLMs) suffer a significant safety degradation compared to specialized models, with open-source versions showing the highest failure rates.

Apr 2, 202676% relevant

Stop Shipping Demo-Perfect Multimodal Systems: A Call for Production-Ready AI

A technical article argues that flashy, demo-perfect multimodal AI systems fail in production. It advocates for 'failure slicing'—rigorously testing edge cases—to build robust pipelines that survive real-world use.

Mar 31, 202696% relevant

MMM4Rec: A New Multi-Modal Mamba Model for Faster, More Transferable Sequential Recommendations

Researchers propose MMM4Rec, a novel sequential recommendation framework using State Space Duality for efficient multi-modal learning. It claims 10x faster fine-tuning convergence and improved accuracy by dynamically prioritizing key visual/textual information over user interaction sequences.

Mar 30, 202690% relevant

Training-Free Polynomial Graph Filtering: A New Paradigm for Ultra-Fast Multimodal Recommendation

Researchers propose a training-free graph filtering method for multimodal recommendation that fuses text, image, and interaction data without neural network training. It achieves up to 22.25% higher accuracy and runs in under 10 seconds, dramatically reducing computational overhead.

Mar 25, 202680% relevant

Gastric-X: New 1.7K-Case Multimodal Benchmark Challenges VLMs on Realistic Gastric Cancer Diagnosis Workflow

Researchers introduce Gastric-X, a comprehensive multimodal benchmark with 1.7K gastric cancer cases including CT scans, endoscopy, lab data, and expert notes. It evaluates VLMs on five clinical tasks to test if they can correlate biochemical signals with tumor features like physicians do.

Mar 23, 202677% relevant

Multimodal RAG System for Chest X-Ray Reports Achieves 0.95 Recall@5, Reduces Hallucinations with Citation Constraints

Researchers developed a multimodal retrieval-augmented generation system for drafting radiology impressions that fuses image and text embeddings. The system achieves Recall@5 above 0.95 on clinically relevant findings and enforces citation coverage to prevent hallucinations.

Mar 23, 202699% relevant

VLM2Rec: A New Framework to Fix 'Modality Collapse' in Multimodal Recommendation Systems

New research proposes VLM2Rec, a method to prevent Vision-Language Models from ignoring one data type (like images or text) when fine-tuned for recommendations. This solves a key technical hurdle for building more accurate, robust sequential recommenders that truly understand multimodal products.

Mar 19, 202686% relevant

AMES: A Scalable, Backend-Agnostic Architecture for Multimodal Enterprise Search

Researchers propose AMES, a unified multimodal retrieval system using late interaction. It enables cross-modal search (text, image, video) within existing enterprise engines like Solr without major redesign, balancing speed and accuracy.

Mar 17, 202679% relevant

New Research Identifies Data Quality as Key Bottleneck in Multimodal Forecasting

A new arXiv paper introduces CAF-7M, a 7-million-sample dataset for context-aided forecasting. The research shows that poor context quality, not model architecture, has limited multimodal forecasting performance. This has implications for retail demand prediction that combines numerical data with text or image context.

Mar 16, 202670% relevant

Goal-Driven Data Optimization: Training Multimodal AI with 95% Less Data

Researchers introduce GDO, a framework that optimizes multimodal instruction tuning by selecting high-utility training samples. It achieves faster convergence and higher accuracy using 5-7% of the data typically required. This addresses compute inefficiency in training vision-language models.

Mar 16, 202671% relevant

Anchored Alignment: A New Framework to Prevent Positional Collapse in Multimodal Recommender Systems

A new arXiv paper proposes AnchorRec, a framework for multimodal recommender systems that uses indirect, anchor-based alignment to preserve modality-specific structures and prevent 'ID dominance,' improving recommendation coherence.

Mar 16, 202689% relevant

Algorithmic Bridging: How Multimodal LLMs Can Enhance Existing Recommendation Systems

A new approach called 'Algorithmic Bridging' proposes combining multimodal conversational LLMs with conventional recommendation systems to boost performance while reusing existing infrastructure. This hybrid method aims to leverage the natural language understanding of LLMs without requiring full system replacement.

Mar 15, 202695% relevant

Google Launches Gemini Embedding 2: A New Multimodal Foundation for AI Applications

Google has released Gemini Embedding 2, a second-generation multimodal embedding model designed to process text, images, and audio simultaneously. This technical advancement creates more unified AI representations, potentially improving search, recommendation, and personalization systems.

Mar 13, 202677% relevant

Explore More

AI Agents Large Language Models Claude Code OpenAI RAG MCP Fine-tuning Benchmarks Open Source AI AI Safety