mllm

30 articles about mllm in AI news

SVoT Boosts MLLM Spatial Reasoning by 65% via RL-Verified Visual Chains

SVoT uses RL to verify MLLM spatial reasoning states, achieving up to 65% accuracy gains on OOD tests across five domains including Pacman and Gather.

Jun 11, 202688% relevant

WorldBench: Top MLLM Scores 64% on Visually Diverse Benchmark

WorldBench, a new multimodal benchmark, tests 15 MLLMs on visually diverse images. Top model scores 64.0%, exposing fundamental gaps in visual understanding.

Jun 8, 202692% relevant

HAVEN Benchmark Exposes MLLM Gap Between Fluency and Video Understanding

HAVEN benchmark tests MLLMs on hierarchical video understanding across frame, shot, and video levels. Results show top models lack grounded multimodal reasoning despite fluent text generation.

May 21, 202685% relevant

MLLM Raters Show Central Tendency Bias in Clinical Scoring

Study finds GPT-5 and other MLLMs show central tendency bias in clinical scoring, compressing predictions toward scale midpoint despite prompt modifications.

May 19, 202670% relevant

VAB Benchmark: Top MLLMs Judge Beauty Correctly Only 26.5% of Time

Frontier MLLMs achieve only 26.5% accuracy on VAB, far below human 68.9%. Fine-tuning bridges the gap.

May 14, 202660% relevant

AFMRL: Using MLLMs to Generate Attributes for Better Product Retrieval in

AFMRL uses MLLMs to generate product attributes, then uses those attributes to train better multimodal representations for e-commerce retrieval. Achieves SOTA on large-scale datasets.

Apr 23, 202684% relevant

Token Warping for MLLMs Outperforms Pixel Methods in View Synthesis

Researchers propose warping image tokens instead of pixels for multi-view reasoning in MLLMs. The zero-shot method is robust to depth noise and outperforms established baselines.

Apr 6, 202697% relevant

MOON3.0: A New Reasoning-Aware MLLM for Fine-Grained E-commerce Product Understanding

A new arXiv paper introduces MOON3.0, a multimodal large language model (MLLM) specifically architected for e-commerce. It uses a novel joint contrastive and reinforcement learning framework to explicitly model fine-grained product details from images and text, outperforming other models on a new benchmark, MBE3.0.

Apr 2, 202694% relevant

DEAF Benchmark Reveals Audio MLLMs Rely on Text, Not Sound, Scoring Below 50% on Acoustic Faithfulness

Researchers introduce DEAF, a 2,700-stimulus benchmark testing Audio MLLMs' acoustic processing. Evaluation of seven models shows a consistent pattern of text dominance, with models scoring below 50% on acoustic faithfulness metrics.

Mar 20, 202699% relevant

CRYSTAL Benchmark Reveals Universal Step-Disorder in MLLMs: No Model Preserves >60% of Reasoning Steps in Correct Order

Researchers introduce CRYSTAL, a 6,372-instance benchmark evaluating multimodal reasoning through verifiable steps. It reveals systematic failures in 20 tested MLLMs, including universal cherry-picking and disordered reasoning chains.

Mar 16, 202695% relevant

MLLMRec-R1: A New Framework for Efficient Multimodal Sequential Recommendation with LLMs

Researchers propose MLLMRec-R1, a framework that makes Group Relative Policy Optimization (GRPO) practical for multimodal sequential recommendation by addressing computational cost and reward inflation issues. This enables more explainable, reasoning-based recommendations.

Mar 9, 202690% relevant

ByteDance Seed's SpatialTree Redefines MLLM Spatial Reasoning at CVPR 2026

ByteDance Seed's SpatialTree achieves 79.8% on SEAL-Bench, 12.4 points above GPT-4V, using hierarchical spatial decomposition. Open-sourced at CVPR 2026.

Jun 22, 2026100% relevant

Zalando Introduces MLLM-Based Evaluation for Product Retrieval

Zalando presents a multimodal LLM-based evaluation for product retrieval, aiming to enhance search relevance in e-commerce. This matters as it could set a new standard for assessing AI in retail search.

Jun 21, 202692% relevant

Visual-Seeker: Active Visual Reasoning Beats Proprietary MLLMs on 5 Benchmarks

Visual-Seeker achieves SOTA on five multimodal search benchmarks, surpassing proprietary models by actively harvesting visual evidence during search.

Jun 16, 202672% relevant

ByteDance's PersonaVLM Boosts MLLM Personalization by 22.4%, Beats GPT-4o

ByteDance researchers unveiled PersonaVLM, a framework that transforms multimodal LLMs into personalized assistants with memory. It improves baseline performance by 22.4% and surpasses GPT-4o by 5.2% on personalized benchmarks.

Apr 20, 202697% relevant

ReDiPrune: Training-Free Token Pruning Before Projection Boosts MLLM Efficiency 6x, Gains 2% Accuracy

Researchers propose ReDiPrune, a plug-and-play method that prunes visual tokens before the vision-language projector in multimodal LLMs. On EgoSchema with LLaVA-NeXT-Video-7B, it achieves a +2.0% accuracy gain while reducing computation by over 6× in TFLOPs.

Mar 27, 202679% relevant

SalesSim: LLMs Score Below 79% on Retail Persona Alignment, RL Boosts 13.8%

SalesSim benchmarks MLLMs as retail customers; top models score below 79% on persona alignment. UserGRPO RL boosts alignment by 13.8%.

May 12, 202691% relevant

Indexing Multimodal LLMs for Large-Scale Image Retrieval

A new arXiv paper proposes using Multimodal LLMs (MLLMs) for instance-level image-to-image retrieval. By prompting models with paired images and converting next-token probabilities into scores, the method enables training-free re-ranking. It shows superior robustness to clutter and occlusion compared to specialized models, though struggles with severe appearance changes.

Apr 16, 202672% relevant

FORGE Benchmark Reveals Domain Knowledge

Researchers introduced FORGE, a multimodal dataset with 2D/3D data and fine-grained annotations for manufacturing. Evaluating 18 MLLMs revealed domain knowledge, not visual grounding, is the key bottleneck, with fine-tuning offering a clear path forward.

Apr 10, 202672% relevant

VHS: Latent Verifier Cuts Diffusion Model Verification Cost by 63.3%, Boosts GenEval by 2.7%

Researchers propose Verifier on Hidden States (VHS), a verifier operating directly on DiT generator features, eliminating costly pixel-space decoding. It reduces joint generation-and-verification time by 63.3% and improves GenEval performance by 2.7% versus MLLM verifiers.

Mar 25, 202695% relevant

Mirage Probes Paper Reveals Two Distinct VLM Failure Modes

Mirage Probes paper reveals VLMs have two distinct failure modes—textual biases and spurious images—requiring different mitigations. Text cleaning only fixes one; the other needs representational interventions.

Jun 15, 202690% relevant

MM-LLM Framework Boosts Recommendation AUC 0.35%, Online Metrics 0.02%

arXiv paper proposes LLaMA2-based MM-LLM framework for recommendation, achieving 0.35% AUC gain and 0.02% online lift at scale.

May 12, 202685% relevant

FashionStylist: New Expert-Annotated Dataset Aims to Unify Multimodal

A new arXiv preprint introduces FashionStylist, a dataset with professional fashion annotations for item grounding, outfit completion, and outfit evaluation. It aims to address the fragmentation in existing fashion AI benchmarks by providing expert-level reasoning data.

Apr 13, 202686% relevant

Benchmark Shadows Study: Data Alignment Limits LLM Generalization

A controlled study finds that data distribution, not just volume, dictates LLM capability. Benchmark-aligned training inflates scores but creates narrow, brittle models, while coverage-expanding data leads to more distributed parameter adaptation and better generalization.

Apr 10, 2026100% relevant

New Benchmark and Methods Target Few-Shot Text-to-Image Retrieval for Complex Queries

Researchers introduce FSIR-BD, a benchmark for few-shot text-to-image retrieval, and two optimization methods to improve performance on compositional and out-of-distribution queries. This addresses a key weakness in pre-trained vision-language models.

Mar 30, 202686% relevant

ReXInTheWild Benchmark Reveals VLMs Struggle with Medical Photos: Gemini-3 Leads at 78%, MedGemma Trails at 37%

Researchers introduced ReXInTheWild, a benchmark of 955 clinician-verified questions based on 484 real medical photographs. Leading multimodal models show wide performance gaps, with Gemini-3 scoring 78% accuracy while the specialized MedGemma model achieved only 37%.

Mar 23, 202675% relevant

SPARROW: A New Method for Precise Object Tracking in Video AI Models

Researchers introduce SPARROW, a technique that improves how AI models track and identify objects in videos with greater spatial precision and temporal consistency. This addresses critical limitations in current video understanding systems.

Mar 16, 202684% relevant

New Benchmark Exposes Critical Weakness in Multimodal AI: Object Orientation

A new AI benchmark, DORI, reveals that state-of-the-art vision-language models perform near-randomly on object orientation tasks. This fundamental spatial reasoning gap has direct implications for retail applications like virtual try-on and visual search.

Mar 13, 202670% relevant

DriveXQA: New AI Framework Helps Autonomous Vehicles See Through Fog and Sensor Failures

Researchers introduce DriveXQA, a multimodal dataset and MVX-LLM architecture that enables autonomous vehicles to answer complex questions about adverse driving conditions by fusing data from multiple visual sensors, significantly improving performance in challenging scenarios like fog.

Mar 13, 202675% relevant

The Next Frontier for Self-Driving Cars: Teaching AI to Think Like a Human

A new survey argues that autonomous driving's biggest hurdle is no longer perception but a lack of robust reasoning. The integration of large language models offers a path forward but creates a critical tension between slow deliberation and split-second safety.

Mar 13, 202681% relevant

Explore More

AI Agents Large Language Models Claude Code OpenAI RAG MCP Fine-tuning Benchmarks Open Source AI AI Safety