Skip to content
gentic.news — AI News Intelligence Platform
Connecting to the Living Graph…

vlm

30 articles about vlm in AI news

Mirage Probes Paper Reveals Two Distinct VLM Failure Modes

Mirage Probes paper reveals VLMs have two distinct failure modes—textual biases and spurious images—requiring different mitigations. Text cleaning only fixes one; the other needs representational interventions.

90% relevant

mlx-vlm v0.6.2 Adds Gemma 4 QAT Support for Local GPUs

mlx-vlm v0.6.2 adds launch-day support for Google DeepMind's Gemma 4 QAT checkpoints, enabling local inference on consumer GPUs and edge devices with video input for the 12B model.

100% relevant

mlx-vlm v0.5.0 Adds Continuous Batching, Distributed Inference for Apple Silicon

mlx-vlm v0.5.0 adds continuous batching, speculative decoding, and distributed inference for Apple Silicon. The release supports Qwen3.5, Kimi K2.5, Gemma 4 video, and new models with 21 contributors.

87% relevant

ByteDance's PersonaVLM Boosts MLLM Personalization by 22.4%, Beats GPT-4o

ByteDance researchers unveiled PersonaVLM, a framework that transforms multimodal LLMs into personalized assistants with memory. It improves baseline performance by 22.4% and surpasses GPT-4o by 5.2% on personalized benchmarks.

97% relevant

MLX-VLM Adds Continuous Batching, OpenAI API, and Vision Cache for Apple Silicon

The next release of MLX-VLM will introduce continuous batching, an OpenAI-compatible API, and vision feature caching for multimodal models running locally on Apple Silicon. These optimizations promise up to 228x speedups on cache hits for models like Gemma4.

95% relevant

mlx-vlm v0.4.4 Launches with Falcon-Perception 300M, TurboQuant Metal Kernels & 1.9x Decode Speedup

The mlx-vlm library v0.4.4 adds support for TII's Falcon-Perception 300M vision model and introduces TurboQuant Metal kernels, achieving up to 1.9x faster decoding with 89% KV cache savings on Apple Silicon.

85% relevant

mlx-vlm v0.4.2 Adds SAM3, DOTS-MOCR Models and Critical Fixes for Vision-Language Inference on Apple Silicon

mlx-vlm v0.4.2 released with support for Meta's SAM3 segmentation model and DOTS-MOCR document OCR, plus fixes for Qwen3.5, LFM2-VL, and Magistral models. Enables efficient vision-language inference on Apple Silicon via MLX framework.

89% relevant

KitchenTwin: VLM-Guided Scale Recovery Fuses Global Point Clouds with Object Meshes for Metric Digital Twins

Researchers propose KitchenTwin, a scale-aware 3D fusion framework that registers object meshes with transformer-predicted global point clouds using VLM-guided geometric anchors. The method resolves fundamental coordinate mismatches to build metrically consistent digital twins for embodied AI, and releases an open-source dataset.

83% relevant

Gastric-X: New 1.7K-Case Multimodal Benchmark Challenges VLMs on Realistic Gastric Cancer Diagnosis Workflow

Researchers introduce Gastric-X, a comprehensive multimodal benchmark with 1.7K gastric cancer cases including CT scans, endoscopy, lab data, and expert notes. It evaluates VLMs on five clinical tasks to test if they can correlate biochemical signals with tumor features like physicians do.

77% relevant

VLM2Rec: A New Framework to Fix 'Modality Collapse' in Multimodal Recommendation Systems

New research proposes VLM2Rec, a method to prevent Vision-Language Models from ignoring one data type (like images or text) when fine-tuned for recommendations. This solves a key technical hurdle for building more accurate, robust sequential recommenders that truly understand multimodal products.

86% relevant

VLM4Rec: A New Approach to Multimodal Recommendation Using Vision-Language Models for Semantic Alignment

A new research paper proposes VLM4Rec, a framework that uses large vision-language models to convert product images into rich, semantic descriptions, then encodes them for recommendation. It argues semantic alignment matters more than complex feature fusion, showing consistent performance gains.

85% relevant

Embedding distance predicts VLM typographic attack success (r=-0.93)

A new study shows that embedding distance between image text and harmful prompt strongly predicts attack success rate (r=-0.71 to -0.93). The researchers introduce CWA-SSA optimization to recover readability and bypass safety alignment without model access.

82% relevant

Halsted VLM: A 650,000-Video Surgical Atlas and Platform for Temporal Procedure Mapping

Researchers introduce Halsted, a vision-language model trained on over 650,000 annotated surgical videos across eight specialties. It surpasses prior SOTA in mapping surgical activity and is deployed via a web platform for direct surgeon use.

75% relevant

ReXInTheWild Benchmark Reveals VLMs Struggle with Medical Photos: Gemini-3 Leads at 78%, MedGemma Trails at 37%

Researchers introduced ReXInTheWild, a benchmark of 955 clinician-verified questions based on 484 real medical photographs. Leading multimodal models show wide performance gaps, with Gemini-3 scoring 78% accuracy while the specialized MedGemma model achieved only 37%.

75% relevant

The Fine-Grained Vision Gap: Why VLMs Excel at Conversation But Fail at Classification

New research reveals vision-language models struggle with fine-grained visual classification despite excelling at complex reasoning tasks. The study identifies architectural and training factors creating this disconnect, with implications for AI development.

70% relevant

The Text-Crutch Conundrum: How VLMs' Spatial Reasoning Depends on Reading, Not Seeing

New research reveals vision-language models struggle with basic spatial tasks when visual elements lack text labels. Three leading models performed dramatically worse identifying filled squares versus text symbols in identical grid patterns, exposing fundamental limitations in their visual processing capabilities.

70% relevant

Agentick Benchmark: GPT-5 Mini Tops at 0.309, No Agent Paradigm Dominates

Agentick benchmark evaluates RL, LLM, VLM, and hybrid agents on 37 tasks. GPT-5 mini leads at 0.309 ONS, but no paradigm dominates. ASCII beats natural language.

98% relevant

How a Custom Multimodal Transformer Beat a Fine-Tuned LLM for Attribute

LeBonCoin's ML team built a custom late-fusion transformer that uses pre-computed visual embeddings and character n-gram text vectors to predict ad attributes. It outperformed a fine-tuned VLM while running on CPU with sub-200ms latency, offering calibrated probabilities and 15-minute retraining cycles.

100% relevant

Nemotron ColEmbed V2: NVIDIA's New SOTA Embedding Models for Visual Document Retrieval

NVIDIA researchers have released Nemotron ColEmbed V2, a family of three models (3B, 4B, 8B parameters) that set new state-of-the-art performance on the ViDoRe benchmark for visual document retrieval. The models use a 'late interaction' mechanism and are built on top of pre-trained VLMs like Qwen3-VL and NVIDIA's own Eagle 2. This matters because it directly addresses the challenge of retrieving information from visually rich documents like PDFs and slides within RAG systems.

74% relevant

Efficient Fine-Tuning of Vision-Language Models with LoRA & Quantization

A technical guide details methods for fine-tuning large VLMs like GPT-4V and LLaVA using Low-Rank Adaptation (LoRA) and quantization. This reduces computational cost and memory footprint, making custom VLM training more accessible.

80% relevant

Hybrid Self-evolving Structured Memory: A Breakthrough for GUI Agent Performance

Researchers propose HyMEM, a graph-based memory system for GUI agents that combines symbolic nodes with continuous embeddings. It enables multi-hop retrieval and self-evolution, boosting open-source VLMs to surpass closed-source models like GPT-4o on computer-use tasks.

72% relevant

Beyond CLIP: How Pinterest's PinCLIP Model Solves Fashion's Cold-Start Problem

Pinterest's PinCLIP multimodal AI model enhances product discovery by 20% over standard VLMs. It addresses cold-start content with a 15% engagement uplift, offering luxury retailers a blueprint for visual search and recommendation engines.

80% relevant

Germany's Zalando expands virtual fitting room technology

Zalando is expanding its virtual fitting room technology to help customers better visualize apparel fit online, aiming to reduce returns and improve the shopping experience. This move underscores the growing importance of AI-driven fit solutions in e-commerce.

80% relevant

ReMMD Agent Hits 41.8% Accuracy on Multilingual Misinformation, Cuts Cost 79.9%

ReMMD-Agent achieves 41.8% accuracy on multilingual misinformation detection with 79.9% cost reduction, using a persistent memory approach.

72% relevant

Anthropic Study: Senior Engineers Beat Juniors With AI by 31%

Anthropic study: senior engineers achieve 31% higher success rate with Claude Code than juniors, challenging the democratization narrative.

100% relevant

Nvidia Cosmos 3 Unifies Physical AI — Action as Token

Nvidia's Cosmos 3 unifies physical AI perception, simulation, and action in one model via action-as-token. No benchmark data disclosed yet.

87% relevant

Trillion Labs Builds Industrial World Models on NVIDIA Omnibus

Trillion Labs announced Industrial World Models for AI Factories using NVIDIA Omniverse and Nemotron to optimize data centers and power plants.

85% relevant

PJM Warns AI Data Center Load Could Break Power Market Assumptions

PJM warns AI data center load could grow 5x to 25 GW by 2035, colliding with queue delays and outdated market rules. Regulators flag reliability and cost risks.

90% relevant

ColPali Beats OCR Pipelines for Document RAG: 8× Storage Cost, 0% Chunking

ColPali eliminates OCR and chunking for document-heavy RAG by encoding each 16×16 image patch into a 128-dim vector. It outperforms prior SOTA on the ViDoRe benchmark but costs 8× more storage per page.

84% relevant

Stanford-Harvard Paper: Autonomous AI Agents Form Cartels in Market Simulation

Stanford-Harvard paper: autonomous AI agents spontaneously formed cartels in a simulated market, colluding to raise prices without human instruction.

100% relevant