MLX-VLM Adds Continuous Batching, OpenAI API, and Vision Cache for Apple Silicon

The next release of MLX-VLM will introduce continuous batching, an OpenAI-compatible API, and vision feature caching for multimodal models running locally on Apple Silicon. These optimizations promise up to 228x speedups on cache hits for models like Gemma4.

AAAla SMITH & AI Research Desk·Apr 16, 2026·6 min read··366 views·AI-Generated·Report error

Source: x.comvia @Prince_CanumaSingle Source

TL;DR

MLX-VLM's next release brings server-side continuous batching, OpenAI-compatible APIs, and vision feature caching, enabling 228x speedups for local multimodal AI on Apple Silicon.

MLX-VLM's Next Release Brings Server-Side Continuous Batching and Vision Caching to Apple Silicon

The MLX-VLM project, which enables running multimodal vision-language models locally on Apple Silicon, is preparing a significant server-side update. According to a preview from developer Prince Canuma, the next release will introduce continuous batching, an OpenAI-compatible API, multi-turn tool calling, and vision feature caching—all running entirely on-device.

Key Takeaways

The next release of MLX-VLM will introduce continuous batching, an OpenAI-compatible API, and vision feature caching for multimodal models running locally on Apple Silicon.
These optimizations promise up to 228x speedups on cache hits for models like Gemma4.

What's New in the Release

OpenAI Launches GPT-4o Fine-Tuning; Qualcomm's New AI-Focused Mid-Range ...

The update focuses on server performance and developer experience for local multimodal AI inference. The key features include:

Continuous Batching: New inference requests can join an active batch immediately without waiting for the current batch to complete. This is particularly valuable for mixed workloads containing both image and text inputs, allowing the server to handle concurrent requests more efficiently.
OpenAI-Compatible API: The server will offer a field-for-field match with the mlx-lm API, including reasoning/content splitting for thinking models and tag-aware streaming. This compatibility makes it easier for developers to switch between cloud and local inference without changing their client code.
Multi-turn Tool Calling: Full tool use support across both streaming and non-streaming modes, compatible with Gemma4 and other model templates. This enables more complex agent-like workflows where models can call external tools across multiple conversation turns.
Vision Feature Caching: The most dramatic performance improvement comes from caching image embeddings across conversation turns. According to the announcement, this delivers a 228x speedup for Gemma4 and a 23x speedup for Qwen3.5 on cache hits.

Technical Details and Performance

All features run locally on Apple Silicon using the MLX framework developed by Apple's machine learning research team. The framework is specifically optimized for Apple's M-series chips, leveraging their unified memory architecture and neural engine capabilities.

A demonstration shows the server handling 4 concurrent requests (with mixed image and text inputs) to Google's Gemma-4-26B-A4B-IT model in bf16 precision. Notably, one request processes an 8K resolution image—a demanding workload that highlights the efficiency gains from the new optimizations.

The vision feature caching is particularly significant for applications where the same images are referenced multiple times, such as in multi-turn conversations about visual content or when processing batches of similar images. By caching the computationally expensive image embeddings, subsequent references to the same image become dramatically faster.

How It Compares

Batching Static batching Continuous batching Variable implementations API Compatibility Custom format OpenAI-compatible OpenAI, Anthropic, etc. Vision Processing Full compute each time 228x cache speedup No local caching Hardware Requirement Apple Silicon Apple Silicon Any with internet Latency Device-dependent Significantly reduced Network-dependent

What This Means for Developers

Apple Unveils MLX: A Machine Learning Framework for Apple Silicon

The update makes local multimodal AI more practical for production use cases. Continuous batching improves server throughput, making it feasible to serve multiple users or applications from a single local instance. The OpenAI API compatibility reduces integration friction, allowing developers to use existing client libraries and tooling.

For applications involving repeated analysis of the same images—such as document processing, medical imaging, or creative workflows—the vision caching could transform performance characteristics. A 228x speedup on cache hits means near-instantaneous responses after the initial image processing.

Limitations and Considerations

While the performance improvements are substantial, they're specific to Apple Silicon hardware. Developers on other platforms won't benefit from these MLX-specific optimizations. Additionally, the caching benefits depend on workload patterns—applications with highly varied images won't see the same dramatic improvements as those with repeated image references.

The local nature of the system also means developers are responsible for model management, security, and scaling, unlike with cloud services that handle these concerns automatically.

gentic.news Analysis

The timing aligns with broader industry trends toward more efficient inference. As we reported in February 2026, both Google and Meta have released research on attention pattern optimization and speculative decoding that similarly aim to improve throughput. MLX-VLM's approach is distinctive in its hardware-specific optimizations for Apple's architecture, creating a competitive advantage for Mac-based AI applications.

The OpenAI API compatibility is particularly strategic. By adopting the de facto standard API format, MLX-VLM lowers switching costs for developers who might otherwise use cloud services. This mirrors the approach taken by other local inference solutions like Ollama and LM Studio, but with the added advantage of Apple hardware optimization.

Looking forward, the vision caching innovation could influence cloud providers as well. If 228x speedups are achievable through intelligent caching of embeddings, similar techniques might appear in cloud services for multimodal models, potentially reducing costs for repetitive visual analysis tasks.

Frequently Asked Questions

What is MLX-VLM?

MLX-VLM is an open-source project that enables running vision-language models (multimodal AI that processes both images and text) locally on Apple Silicon Macs. It's built on Apple's MLX framework, which is optimized for Apple's M-series chips.

How does vision feature caching work?

When a model processes an image, it first converts it into numerical embeddings (vector representations). Vision feature caching stores these embeddings so that if the same image is processed again—either in the same conversation or a different one—the system can skip the expensive embedding computation and use the cached version instead, resulting in dramatic speedups.

Can I use MLX-VLM with any multimodal model?

MLX-VLM supports various model architectures through templates. The announcement specifically mentions compatibility with Google's Gemma4 and Alibaba's Qwen3.5 models, but the framework likely supports other popular vision-language models that can be converted to the MLX format.

What hardware do I need to run MLX-VLM?

You need a Mac with Apple Silicon (M1, M2, M3, or later chips). The performance will vary based on your specific chip, memory configuration, and model size. The demonstration mentioned in the announcement was run on an M3 Ultra, Apple's highest-end consumer chip.

Sources cited in this article

Prince Canuma

Source: gentic.news · Apr 16, 2026 · author=Ala SMITH · citation.json

AI-assisted reporting. Generated by gentic.news from 1 verified source, fact-checked against the Living Graph of 4,300+ entities. Edited by Ala SMITH.

Following this story?

Get a weekly digest with AI predictions, trends, and analysis — free.

AI Analysis

This update represents a significant maturation of the local AI inference ecosystem on Apple hardware. MLX-VLM builds upon the foundation established by Apple's MLX framework, which we covered in December 2024 when it first enabled efficient large language model inference on Apple Silicon. The addition of continuous batching and vision caching addresses two critical bottlenecks for production deployment: throughput and computational efficiency for vision tasks. The timing aligns with broader industry trends toward more efficient inference. As we reported in February 2026, both Google and Meta have released research on attention pattern optimization and speculative decoding that similarly aim to improve throughput. MLX-VLM's approach is distinctive in its hardware-specific optimizations for Apple's architecture, creating a competitive advantage for Mac-based AI applications. The OpenAI API compatibility is particularly strategic. By adopting the de facto standard API format, MLX-VLM lowers switching costs for developers who might otherwise use cloud services. This mirrors the approach taken by other local inference solutions like Ollama and LM Studio, but with the added advantage of Apple hardware optimization. Looking forward, the vision caching innovation could influence cloud providers as well. If 228x speedups are achievable through intelligent caching of embeddings, similar techniques might appear in cloud services for multimodal models, potentially reducing costs for repetitive visual analysis tasks.

#inference-optimization #open-source #apple #computer-vision

Mentioned in this article

mlx-vlm OpenAI Prince Canuma Gemma 4

Enjoyed this article?