vlm
30 articles about vlm in AI news
mlx-vlm v0.5.0 Adds Continuous Batching, Distributed Inference for Apple Silicon
mlx-vlm v0.5.0 adds continuous batching, speculative decoding, and distributed inference for Apple Silicon. The release supports Qwen3.5, Kimi K2.5, Gemma 4 video, and new models with 21 contributors.
ByteDance's PersonaVLM Boosts MLLM Personalization by 22.4%, Beats GPT-4o
ByteDance researchers unveiled PersonaVLM, a framework that transforms multimodal LLMs into personalized assistants with memory. It improves baseline performance by 22.4% and surpasses GPT-4o by 5.2% on personalized benchmarks.
MLX-VLM Adds Continuous Batching, OpenAI API, and Vision Cache for Apple Silicon
The next release of MLX-VLM will introduce continuous batching, an OpenAI-compatible API, and vision feature caching for multimodal models running locally on Apple Silicon. These optimizations promise up to 228x speedups on cache hits for models like Gemma4.
mlx-vlm v0.4.4 Launches with Falcon-Perception 300M, TurboQuant Metal Kernels & 1.9x Decode Speedup
The mlx-vlm library v0.4.4 adds support for TII's Falcon-Perception 300M vision model and introduces TurboQuant Metal kernels, achieving up to 1.9x faster decoding with 89% KV cache savings on Apple Silicon.
mlx-vlm v0.4.2 Adds SAM3, DOTS-MOCR Models and Critical Fixes for Vision-Language Inference on Apple Silicon
mlx-vlm v0.4.2 released with support for Meta's SAM3 segmentation model and DOTS-MOCR document OCR, plus fixes for Qwen3.5, LFM2-VL, and Magistral models. Enables efficient vision-language inference on Apple Silicon via MLX framework.
KitchenTwin: VLM-Guided Scale Recovery Fuses Global Point Clouds with Object Meshes for Metric Digital Twins
Researchers propose KitchenTwin, a scale-aware 3D fusion framework that registers object meshes with transformer-predicted global point clouds using VLM-guided geometric anchors. The method resolves fundamental coordinate mismatches to build metrically consistent digital twins for embodied AI, and releases an open-source dataset.
Gastric-X: New 1.7K-Case Multimodal Benchmark Challenges VLMs on Realistic Gastric Cancer Diagnosis Workflow
Researchers introduce Gastric-X, a comprehensive multimodal benchmark with 1.7K gastric cancer cases including CT scans, endoscopy, lab data, and expert notes. It evaluates VLMs on five clinical tasks to test if they can correlate biochemical signals with tumor features like physicians do.
VLM2Rec: A New Framework to Fix 'Modality Collapse' in Multimodal Recommendation Systems
New research proposes VLM2Rec, a method to prevent Vision-Language Models from ignoring one data type (like images or text) when fine-tuned for recommendations. This solves a key technical hurdle for building more accurate, robust sequential recommenders that truly understand multimodal products.
VLM4Rec: A New Approach to Multimodal Recommendation Using Vision-Language Models for Semantic Alignment
A new research paper proposes VLM4Rec, a framework that uses large vision-language models to convert product images into rich, semantic descriptions, then encodes them for recommendation. It argues semantic alignment matters more than complex feature fusion, showing consistent performance gains.
Embedding distance predicts VLM typographic attack success (r=-0.93)
A new study shows that embedding distance between image text and harmful prompt strongly predicts attack success rate (r=-0.71 to -0.93). The researchers introduce CWA-SSA optimization to recover readability and bypass safety alignment without model access.
Halsted VLM: A 650,000-Video Surgical Atlas and Platform for Temporal Procedure Mapping
Researchers introduce Halsted, a vision-language model trained on over 650,000 annotated surgical videos across eight specialties. It surpasses prior SOTA in mapping surgical activity and is deployed via a web platform for direct surgeon use.
ReXInTheWild Benchmark Reveals VLMs Struggle with Medical Photos: Gemini-3 Leads at 78%, MedGemma Trails at 37%
Researchers introduced ReXInTheWild, a benchmark of 955 clinician-verified questions based on 484 real medical photographs. Leading multimodal models show wide performance gaps, with Gemini-3 scoring 78% accuracy while the specialized MedGemma model achieved only 37%.
The Fine-Grained Vision Gap: Why VLMs Excel at Conversation But Fail at Classification
New research reveals vision-language models struggle with fine-grained visual classification despite excelling at complex reasoning tasks. The study identifies architectural and training factors creating this disconnect, with implications for AI development.
The Text-Crutch Conundrum: How VLMs' Spatial Reasoning Depends on Reading, Not Seeing
New research reveals vision-language models struggle with basic spatial tasks when visual elements lack text labels. Three leading models performed dramatically worse identifying filled squares versus text symbols in identical grid patterns, exposing fundamental limitations in their visual processing capabilities.
How a Custom Multimodal Transformer Beat a Fine-Tuned LLM for Attribute
LeBonCoin's ML team built a custom late-fusion transformer that uses pre-computed visual embeddings and character n-gram text vectors to predict ad attributes. It outperformed a fine-tuned VLM while running on CPU with sub-200ms latency, offering calibrated probabilities and 15-minute retraining cycles.
Nemotron ColEmbed V2: NVIDIA's New SOTA Embedding Models for Visual Document Retrieval
NVIDIA researchers have released Nemotron ColEmbed V2, a family of three models (3B, 4B, 8B parameters) that set new state-of-the-art performance on the ViDoRe benchmark for visual document retrieval. The models use a 'late interaction' mechanism and are built on top of pre-trained VLMs like Qwen3-VL and NVIDIA's own Eagle 2. This matters because it directly addresses the challenge of retrieving information from visually rich documents like PDFs and slides within RAG systems.
Efficient Fine-Tuning of Vision-Language Models with LoRA & Quantization
A technical guide details methods for fine-tuning large VLMs like GPT-4V and LLaVA using Low-Rank Adaptation (LoRA) and quantization. This reduces computational cost and memory footprint, making custom VLM training more accessible.
Hybrid Self-evolving Structured Memory: A Breakthrough for GUI Agent Performance
Researchers propose HyMEM, a graph-based memory system for GUI agents that combines symbolic nodes with continuous embeddings. It enables multi-hop retrieval and self-evolution, boosting open-source VLMs to surpass closed-source models like GPT-4o on computer-use tasks.
Beyond CLIP: How Pinterest's PinCLIP Model Solves Fashion's Cold-Start Problem
Pinterest's PinCLIP multimodal AI model enhances product discovery by 20% over standard VLMs. It addresses cold-start content with a 15% engagement uplift, offering luxury retailers a blueprint for visual search and recommendation engines.
Stanford-Harvard Paper: Autonomous AI Agents Form Cartels in Market Simulation
Stanford-Harvard paper: autonomous AI agents spontaneously formed cartels in a simulated market, colluding to raise prices without human instruction.
AFMRL: Using MLLMs to Generate Attributes for Better Product Retrieval in
AFMRL uses MLLMs to generate product attributes, then uses those attributes to train better multimodal representations for e-commerce retrieval. Achieves SOTA on large-scale datasets.
Rethinking the Necessity of Adaptive Retrieval-Augmented Generation
Researchers propose AdaRankLLM, a framework that dynamically decides when to retrieve external data for LLMs. It reduces computational overhead while maintaining performance, shifting adaptive retrieval's role based on model strength.
Fei-Fei Li Explains Why 'Open the Top Drawer' Is a Hard AI Problem
AI pioneer Fei-Fei Li breaks down why a simple instruction like 'open the top drawer and watch out for the vase' represents a major unsolved challenge in robotics, requiring robust perception, commonsense reasoning, and efficient learning from sparse rewards.
Navox Agents: 8 Specialized Claude Code Agents with Human Checkpoints
Install the Navox Agents plugin to access eight specialized AI agents (Architect, UI/UX, Security, Full Stack, etc.) that work in parallel with human approval gates for complex Claude Code projects.
GeoAgentBench: New Dynamic Benchmark Tests LLM Agents on 117 GIS Tools
A new benchmark, GeoAgentBench, evaluates LLM-based GIS agents in a dynamic sandbox with 117 tools. It introduces a novel Plan-and-React agent architecture that outperforms existing frameworks in multi-step spatial tasks.
RAG-Anything: Multimodal RAG for Text, Images, Tables & Formulas
An open-source project, RAG-Anything, tackles a major flaw in most RAG systems by enabling them to process and connect information from text, images, tables, and formulas within documents.
New Research Proposes Authority-aware Generative Retrieval (AuthGR) for
A new arXiv paper introduces an Authority-aware Generative Retriever (AuthGR) framework. It uses multimodal signals to score document trustworthiness and trains a model to prioritize authoritative sources. Large-scale online A/B tests on a commercial search platform report significant improvements in user engagement and reliability.
Developer Swaps Dash Cam Analysis for Gemma 4 & Falcon Perception
A developer announced they are replacing their entire dash cam video analysis system with Google's Gemma 4 and Falcon Perception models, signaling a practical shift towards newer, specialized multimodal models for real-time edge applications.
NVIDIA Ising AI OS Cuts Quantum Calibration from Days to Hours
NVIDIA launched Ising, an open-source AI model family that acts as an OS for quantum computers. It uses a vision language model to automate calibration and a 3D neural network for error correction, reducing calibration from days to hours.
AllenAI's WildDet3D Enables Promptable 3D Object Detection from Single Images
Allen Institute for AI (AllenAI) has open-sourced WildDet3D, a model for promptable 3D object detection from single RGB images. It predicts 3D bounding boxes using flexible prompts and can integrate optional depth data.