Apple Silicon developers working with multimodal AI have a significant update to integrate. The mlx-vlm library, which enables running vision-language models (VLMs) on Apple's Metal Performance Shaders framework, has released version 0.4.4 with substantial performance optimizations and new model support.
What's New in v0.4.4
The update focuses on three key areas: new model integration, core performance improvements for Apple hardware, and critical bug fixes for existing models.
New Model: Falcon-Perception 300M
The release adds support for the Falcon-Perception 300M model from the Technology Innovation Institute (TII). This compact 300-million parameter vision-language model represents TII's continued push into efficient multimodal AI following their Falcon series of text-only models. The inclusion provides mlx-vlm users with another option for on-device vision understanding tasks.
Performance Breakthrough: TurboQuant Metal Kernels
The headline technical improvement is the introduction of TurboQuant Metal kernels. According to the release notes, these optimized kernels deliver:
- Up to 1.90× decode speed improvement over the baseline implementation
- 89% KV cache savings on longer context sequences
These optimizations specifically target the decoding phase of inference, which is typically the bottleneck for interactive applications. The 89% KV cache reduction is particularly significant for memory-constrained Apple devices, allowing for longer context windows without hitting memory limits.
VisionFeatureCache: Multi-Turn Image Caching
A new VisionFeatureCache system addresses a common inefficiency in conversational VLMs. Previously, when users referenced the same image across multiple turns in a conversation, the model would re-encode the image each time—wasting computational resources and increasing latency. The new caching system stores image features after the first encoding, eliminating redundant processing for subsequent turns involving the same image.
Technical Fixes and Improvements
The release includes several important fixes for existing models:
Gemma 4 Enhancements
- Chunked prefill for KV-shared models: Improves memory management during the initial prompt processing phase
- Vision + text degradation fixes: Addresses quality issues when processing combined image and text inputs
- Processor configuration improvements: Better handling of model configuration parameters
- Nested tool parsing fixes: Corrects issues with complex tool-calling scenarios
Video CLI Fixes
The command-line interface for video processing has been stabilized with various bug fixes, improving reliability for video analysis workflows.
Getting Started
Developers can update immediately using:
uv pip install -U mlx-vlm
Or with pip:
pip install --upgrade mlx-vlm
The update maintains mlx-vlm's focus on bringing efficient vision-language model inference to Apple's ecosystem, competing with other edge-focused frameworks while leveraging Apple's Metal API for hardware acceleration.
gentic.news Analysis
This release represents a maturation point for the mlx-vlm project, which we first covered in December 2025 when it reached version 0.3.0 with initial Gemma 2B support. The addition of TurboQuant Metal kernels marks a significant engineering achievement—achieving 1.9× decode speedup through kernel-level optimizations shows the team is moving beyond simple framework wrapping to genuine performance innovation.
The timing aligns with Apple's increased focus on on-device AI, particularly following their acquisition of DarwinAI in early 2025 and the integration of more advanced neural engines in their M4 and upcoming M5 chips. mlx-vlm's optimizations directly complement Apple's hardware roadmap, which emphasizes energy-efficient inference for sustained AI workloads.
Falcon-Perception 300M's inclusion is noteworthy as it represents TII's first vision model in the mlx ecosystem. This follows TII's pattern of releasing efficient models (their Falcon-7B was notable for its performance-per-parameter ratio) and suggests they're targeting the edge inference market where Apple Silicon dominates. The 300M parameter size is particularly suited for mobile deployment, potentially enabling new categories of on-device visual assistants.
The VisionFeatureCache innovation addresses a real-world usability issue that has persisted in multimodal chat applications. By eliminating redundant image encoding, this could make multi-turn visual conversations significantly more responsive—a critical improvement for applications like coding assistants with screenshot context or document analysis tools.
Looking forward, the 89% KV cache savings for long contexts could enable more complex visual reasoning tasks on consumer hardware. As context windows continue to expand across the industry (with models now routinely supporting 128K+ tokens), efficient cache management becomes increasingly critical for practical deployment.
Frequently Asked Questions
What is mlx-vlm and who maintains it?
mlx-vlm is an open-source library that enables running vision-language models on Apple Silicon Macs using Apple's Metal Performance Shaders framework. It's maintained by independent developers (including @Prince_Canuma) and provides a Python interface similar to Hugging Face's transformers library but optimized for Apple hardware.
How does the 1.9× decode speed improvement affect real applications?
The decode speed improvement directly reduces latency when the model generates text responses. For interactive applications like chatbots with visual context, coding assistants that analyze screenshots, or document understanding tools, this means faster responses and smoother user experiences. The improvement is most noticeable in longer conversations where multiple decoding steps occur.
Can I use mlx-vlm with existing Hugging Face models?
mlx-vlm supports specific model architectures that have been ported to the mlx framework. While not all Hugging Face models are compatible, the library supports popular architectures like LLaVA, Gemma, and now Falcon-Perception. You typically need to convert weights from standard formats to mlx-compatible formats, which the library's documentation explains.
What Apple hardware is required for these optimizations?
The TurboQuant Metal kernels require Apple Silicon Macs (M1, M2, M3, or M4 series) with macOS Sonoma or later. The optimizations specifically leverage the Neural Engine and GPU capabilities through Metal Performance Shaders, so performance improvements will vary based on your specific chip generation and memory configuration.






