Skip to content
gentic.news — AI News Intelligence Platform
Connecting to the Living Graph…

Listen to today's AI briefing

Daily podcast — 5 min, AI-narrated summary of top stories

Apple MacBook Pro running MLX-VLM inference with code editor and terminal windows displaying model output

mlx-vlm v0.5.0 Adds Continuous Batching, Distributed Inference for Apple Silicon

mlx-vlm v0.5.0 adds continuous batching, speculative decoding, and distributed inference for Apple Silicon. The release supports Qwen3.5, Kimi K2.5, Gemma 4 video, and new models with 21 contributors.

·5h ago·2 min read··12 views·AI-Generated·Report error
Share:
What's new in mlx-vlm v0.5.0?

mlx-vlm v0.5.0 adds continuous batching, KV cache quantization, MTP/DFlash speculative decoding, distributed inference for Qwen3.5 and Kimi K2.5, and Gemma 4 video support. Released by @Prince_Canuma with 21 contributors.

TL;DR

Continuous batching server with KV cache quantization · Distributed inference for Qwen3.5, Kimi K2.5 · Gemma 4 video support with MTP drafter

@Prince_Canuma released mlx-vlm v0.5.0, the largest update yet. The release adds continuous batching, speculative decoding, and distributed inference for Apple Silicon.

Key facts

  • 21 total contributors, 18 new to the project
  • Continuous batching server with KV cache quantization
  • Supports Qwen3.5, Kimi K2.5, K2.6 distributed inference
  • Gemma 4 video with multi-video and MTP drafter
  • Prompt caching with warm-disk persistence

Continuous Batching and Inference Optimizations

mlx-vlm v0.5.0 introduces a continuous batching server with KV cache quantization [According to @Prince_Canuma]. This allows dynamic batching of incoming requests, reducing latency and improving throughput for VLM inference on Apple Silicon hardware. The release also adds MTP (Multi-Token Prediction) and DFlash speculative decoding, available in single, batch, and server modes. These optimizations can accelerate token generation by predicting multiple tokens per step.

Distributed Inference and New Model Support

Distributed inference now supports Qwen3.5, Kimi K2.5, and K2.6 models [per the release notes]. This enables users to split large model inference across multiple Apple Silicon devices, reducing memory pressure and enabling larger context windows. Prompt caching with warm-disk persistence allows cached prompts to survive server restarts, improving cold-start latency.

New Models and Server Features

The release adds support for Gemma 4 video processing, including multi-video input and an MTP drafter [According to @googlegemma]. New models include Youtu-VL, Nemotron 3 Nano Omni, and SAM 3D Body. The server now supports json_schema response_format and a thinking mode flag for structured outputs and chain-of-thought reasoning.

Unique Take

The biggest story here is not the model count but the speculative decoding and distributed inference infrastructure. Most open-source VLM projects focus on single-device inference. mlx-vlm's support for MTP and DFlash, combined with distributed execution across Apple Silicon clusters, positions it as a viable alternative to NVIDIA-based inference stacks for edge and on-device deployment. This is the first major open-source VLM project to offer these optimizations for Apple's unified memory architecture.

What to watch

Watch for benchmark comparisons of mlx-vlm v0.5.0 against NVIDIA-based inference stacks (vLLM, TensorRT-LLM) on throughput and latency, especially for distributed inference on multi-node Apple Silicon clusters. Also monitor adoption of the Gemma 4 video pipeline in production use cases.

Source: gentic.news · · author= · citation.json

AI-assisted reporting. Generated by gentic.news from multiple verified sources, fact-checked against the Living Graph of 4,300+ entities. Edited by Ala SMITH.

Following this story?

Get a weekly digest with AI predictions, trends, and analysis — free.

AI Analysis

The mlx-vlm v0.5.0 release is a significant step for on-device VLM inference. While most open-source VLM projects focus on single-device inference on NVIDIA hardware, mlx-vlm is building out the infrastructure for Apple Silicon clusters. The inclusion of MTP and DFlash speculative decoding is particularly noteworthy, as these techniques are still rare in open-source inference engines. The distributed inference support for Qwen3.5 and Kimi models suggests a focus on large-context, multi-device workflows that are increasingly relevant for enterprise applications. However, the release lacks benchmark numbers. The community will need to validate whether the continuous batching and speculative decoding optimizations translate to real-world throughput gains. The prompt caching with warm-disk persistence is a practical feature that addresses a common pain point in production deployments—cold-start latency after server restarts. The addition of Gemma 4 video support with multi-video input and an MTP drafter positions mlx-vlm as a strong candidate for video understanding tasks on Apple hardware. This is a domain where NVIDIA-based solutions currently dominate, but Apple's unified memory architecture offers advantages for large context windows. The project's 18 new contributors indicate growing community interest, but the lack of a formal benchmark suite makes it difficult to compare against vLLM or TensorRT-LLM.
Compare side-by-side
mlx-vlm vs Apple Silicon

Mentioned in this article

Enjoyed this article?
Share:

AI Toolslive

Five one-click lenses on this article. Cached for 24h.

Pick a tool above to generate an instant lens on this article.

Related Articles

More in Products & Launches

View all