@Prince_Canuma released mlx-vlm v0.5.0, the largest update yet. The release adds continuous batching, speculative decoding, and distributed inference for Apple Silicon.
Key facts
- 21 total contributors, 18 new to the project
- Continuous batching server with KV cache quantization
- Supports Qwen3.5, Kimi K2.5, K2.6 distributed inference
- Gemma 4 video with multi-video and MTP drafter
- Prompt caching with warm-disk persistence
Continuous Batching and Inference Optimizations
mlx-vlm v0.5.0 introduces a continuous batching server with KV cache quantization [According to @Prince_Canuma]. This allows dynamic batching of incoming requests, reducing latency and improving throughput for VLM inference on Apple Silicon hardware. The release also adds MTP (Multi-Token Prediction) and DFlash speculative decoding, available in single, batch, and server modes. These optimizations can accelerate token generation by predicting multiple tokens per step.
Distributed Inference and New Model Support
Distributed inference now supports Qwen3.5, Kimi K2.5, and K2.6 models [per the release notes]. This enables users to split large model inference across multiple Apple Silicon devices, reducing memory pressure and enabling larger context windows. Prompt caching with warm-disk persistence allows cached prompts to survive server restarts, improving cold-start latency.
New Models and Server Features
The release adds support for Gemma 4 video processing, including multi-video input and an MTP drafter [According to @googlegemma]. New models include Youtu-VL, Nemotron 3 Nano Omni, and SAM 3D Body. The server now supports json_schema response_format and a thinking mode flag for structured outputs and chain-of-thought reasoning.
Unique Take
The biggest story here is not the model count but the speculative decoding and distributed inference infrastructure. Most open-source VLM projects focus on single-device inference. mlx-vlm's support for MTP and DFlash, combined with distributed execution across Apple Silicon clusters, positions it as a viable alternative to NVIDIA-based inference stacks for edge and on-device deployment. This is the first major open-source VLM project to offer these optimizations for Apple's unified memory architecture.
What to watch
Watch for benchmark comparisons of mlx-vlm v0.5.0 against NVIDIA-based inference stacks (vLLM, TensorRT-LLM) on throughput and latency, especially for distributed inference on multi-node Apple Silicon clusters. Also monitor adoption of the Gemma 4 video pipeline in production use cases.








