Listen to today's AI briefing

Daily podcast — 5 min, AI-narrated summary of top stories

Apple MacBook Pro running MLX-VLM inference with code editor and terminal windows displaying model output

mlx-vlm v0.5.0 Adds Continuous Batching, Distributed Inference for Apple Silicon

mlx-vlm v0.5.0 adds continuous batching, speculative decoding, and distributed inference for Apple Silicon. The release supports Qwen3.5, Kimi K2.5, Gemma 4 video, and new models with 21 contributors.

AAAla SMITH & AI Research Desk·5h ago·2 min read··12 views·AI-Generated·Report error

Source: x.comvia @Prince_CanumaSingle Source

What's new in mlx-vlm v0.5.0?

mlx-vlm v0.5.0 adds continuous batching, KV cache quantization, MTP/DFlash speculative decoding, distributed inference for Qwen3.5 and Kimi K2.5, and Gemma 4 video support. Released by @Prince_Canuma with 21 contributors.

TL;DR

Continuous batching server with KV cache quantization · Distributed inference for Qwen3.5, Kimi K2.5 · Gemma 4 video support with MTP drafter

@Prince_Canuma released mlx-vlm v0.5.0, the largest update yet. The release adds continuous batching, speculative decoding, and distributed inference for Apple Silicon.

Key facts

21 total contributors, 18 new to the project
Continuous batching server with KV cache quantization
Supports Qwen3.5, Kimi K2.5, K2.6 distributed inference
Gemma 4 video with multi-video and MTP drafter
Prompt caching with warm-disk persistence

Continuous Batching and Inference Optimizations

mlx-vlm v0.5.0 introduces a continuous batching server with KV cache quantization [According to @Prince_Canuma]. This allows dynamic batching of incoming requests, reducing latency and improving throughput for VLM inference on Apple Silicon hardware. The release also adds MTP (Multi-Token Prediction) and DFlash speculative decoding, available in single, batch, and server modes. These optimizations can accelerate token generation by predicting multiple tokens per step.

Distributed Inference and New Model Support

Distributed inference now supports Qwen3.5, Kimi K2.5, and K2.6 models [per the release notes]. This enables users to split large model inference across multiple Apple Silicon devices, reducing memory pressure and enabling larger context windows. Prompt caching with warm-disk persistence allows cached prompts to survive server restarts, improving cold-start latency.

New Models and Server Features

The release adds support for Gemma 4 video processing, including multi-video input and an MTP drafter [According to @googlegemma]. New models include Youtu-VL, Nemotron 3 Nano Omni, and SAM 3D Body. The server now supports json_schema response_format and a thinking mode flag for structured outputs and chain-of-thought reasoning.

Unique Take

The biggest story here is not the model count but the speculative decoding and distributed inference infrastructure. Most open-source VLM projects focus on single-device inference. mlx-vlm's support for MTP and DFlash, combined with distributed execution across Apple Silicon clusters, positions it as a viable alternative to NVIDIA-based inference stacks for edge and on-device deployment. This is the first major open-source VLM project to offer these optimizations for Apple's unified memory architecture.

What to watch

Watch for benchmark comparisons of mlx-vlm v0.5.0 against NVIDIA-based inference stacks (vLLM, TensorRT-LLM) on throughput and latency, especially for distributed inference on multi-node Apple Silicon clusters. Also monitor adoption of the Gemma 4 video pipeline in production use cases.

Source: gentic.news · 5h ago · author=Ala SMITH · citation.json

AI-assisted reporting. Generated by gentic.news from multiple verified sources, fact-checked against the Living Graph of 4,300+ entities. Edited by Ala SMITH.

Following this story?

Get a weekly digest with AI predictions, trends, and analysis — free.

AI Analysis

The mlx-vlm v0.5.0 release is a significant step for on-device VLM inference. While most open-source VLM projects focus on single-device inference on NVIDIA hardware, mlx-vlm is building out the infrastructure for Apple Silicon clusters. The inclusion of MTP and DFlash speculative decoding is particularly noteworthy, as these techniques are still rare in open-source inference engines. The distributed inference support for Qwen3.5 and Kimi models suggests a focus on large-context, multi-device workflows that are increasingly relevant for enterprise applications. However, the release lacks benchmark numbers. The community will need to validate whether the continuous batching and speculative decoding optimizations translate to real-world throughput gains. The prompt caching with warm-disk persistence is a practical feature that addresses a common pain point in production deployments—cold-start latency after server restarts. The addition of Gemma 4 video support with multi-video input and an MTP drafter positions mlx-vlm as a strong candidate for video understanding tasks on Apple hardware. This is a domain where NVIDIA-based solutions currently dominate, but Apple's unified memory architecture offers advantages for large context windows. The project's 18 new contributors indicate growing community interest, but the lack of a formal benchmark suite makes it difficult to compare against vLLM or TensorRT-LLM.

#open-source #apple #inference #vlm

Compare side-by-side

mlx-vlm vs Apple Silicon

→

Mentioned in this article

mlx-vlm Apple Silicon

Enjoyed this article?

Get the weekly AI intelligence briefing

✨AI Toolslive

Five one-click lenses on this article. Cached for 24h.

Pick a tool above to generate an instant lens on this article.

Products & Launches2 shared topics

mlx-vlm v0.4.4 Launches with Falcon-Perception 300M, TurboQuant Metal Kernels & 1.9x Decode Speedup

Products & Launches

Nvidia's China Market Share Hits Zero, Huang Says

More in Products & Launches

View all

Greg Brockman in a courtroom, facing a federal judge, with documents referencing a $1B journal entry and his $30B…

Products & Launches

OpenAI Trial Reveals Brockman's $1B Journal Entry, $30B Net Worth

Greg Brockman's 2017 journal entry asking how to reach $1B was unsealed in the OpenAI trial, revealing he walked into court worth $30B while Musk donated $38M.

x.com/10h ago/3 min read

wealth creationlegalopenai

Luma Labs Uni-1 API dashboard on a laptop screen, developer typing code, creative pipeline workflow icons floating…

Products & Launches

Luma Labs Opens Uni-1.1 API for Production — Image, Not Video, and #1 ELO Comes With a Caveat

Luma Labs has shipped the Uni-1.1 API for production — an image-generation model (not video) with two REST endpoints, Python and JavaScript SDKs, and support for up to nine reference images per call. The widely-cited '#1 Human Preference ELO' is from Luma's own internal pairwise evaluation; on pure text-to-image Luma reports #2 behind Google Nano Banana. Pricing: ~$0.09 per 2K image, 10–30% below Nano Banana 2 / Pro.

x.com/13h ago/3 min read

creative aiai modelsdeveloper tools

Continuous Batching and Inference Optimizations

Distributed Inference and New Model Support

New Models and Server Features

Unique Take

What to watch

AI Analysis

✨AI Toolslive

Related Articles

mlx-vlm v0.4.4 Launches with Falcon-Perception 300M, TurboQuant Metal Kernels & 1.9x Decode Speedup

Anthropic Doubles Claude Code Rate Limits, Leases All of SpaceX's Colossus 1

NVIDIA Open-Sources MRC, the RDMA Protocol Powering OpenAI's Blackwell Clusters

Meta Building Agentic AI Tool for 3B+ Users, Sources Say

Anthropic Launches Wall Street Agents, $1.5B JV with Blackstone

Nvidia's China Market Share Hits Zero, Huang Says

More in Products & Launches

OpenAI Trial Reveals Brockman's $1B Journal Entry, $30B Net Worth

Luma Labs Opens Uni-1.1 API for Production — Image, Not Video, and #1 ELO Comes With a Caveat