Listen to today's AI briefing

Daily podcast — 5 min, AI-narrated summary of top stories

mlx-vlm v0.4.4 Launches with Falcon-Perception 300M, TurboQuant Metal Kernels & 1.9x Decode Speedup

mlx-vlm v0.4.4 Launches with Falcon-Perception 300M, TurboQuant Metal Kernels & 1.9x Decode Speedup

The mlx-vlm library v0.4.4 adds support for TII's Falcon-Perception 300M vision model and introduces TurboQuant Metal kernels, achieving up to 1.9x faster decoding with 89% KV cache savings on Apple Silicon.

GAla Smith & AI Research Desk·4h ago·5 min read·17 views·AI-Generated
Share:
mlx-vlm v0.4.4 Launches with Falcon-Perception 300M, TurboQuant Metal Kernels & 1.9x Decode Speedup

Apple Silicon developers working with multimodal AI have a significant update to integrate. The mlx-vlm library, which enables running vision-language models (VLMs) on Apple's Metal Performance Shaders framework, has released version 0.4.4 with substantial performance optimizations and new model support.

What's New in v0.4.4

The update focuses on three key areas: new model integration, core performance improvements for Apple hardware, and critical bug fixes for existing models.

New Model: Falcon-Perception 300M
The release adds support for the Falcon-Perception 300M model from the Technology Innovation Institute (TII). This compact 300-million parameter vision-language model represents TII's continued push into efficient multimodal AI following their Falcon series of text-only models. The inclusion provides mlx-vlm users with another option for on-device vision understanding tasks.

Performance Breakthrough: TurboQuant Metal Kernels
The headline technical improvement is the introduction of TurboQuant Metal kernels. According to the release notes, these optimized kernels deliver:

  • Up to 1.90× decode speed improvement over the baseline implementation
  • 89% KV cache savings on longer context sequences

These optimizations specifically target the decoding phase of inference, which is typically the bottleneck for interactive applications. The 89% KV cache reduction is particularly significant for memory-constrained Apple devices, allowing for longer context windows without hitting memory limits.

VisionFeatureCache: Multi-Turn Image Caching
A new VisionFeatureCache system addresses a common inefficiency in conversational VLMs. Previously, when users referenced the same image across multiple turns in a conversation, the model would re-encode the image each time—wasting computational resources and increasing latency. The new caching system stores image features after the first encoding, eliminating redundant processing for subsequent turns involving the same image.

Technical Fixes and Improvements

The release includes several important fixes for existing models:

Gemma 4 Enhancements

  • Chunked prefill for KV-shared models: Improves memory management during the initial prompt processing phase
  • Vision + text degradation fixes: Addresses quality issues when processing combined image and text inputs
  • Processor configuration improvements: Better handling of model configuration parameters
  • Nested tool parsing fixes: Corrects issues with complex tool-calling scenarios

Video CLI Fixes
The command-line interface for video processing has been stabilized with various bug fixes, improving reliability for video analysis workflows.

Getting Started

Developers can update immediately using:

uv pip install -U mlx-vlm

Or with pip:

pip install --upgrade mlx-vlm

The update maintains mlx-vlm's focus on bringing efficient vision-language model inference to Apple's ecosystem, competing with other edge-focused frameworks while leveraging Apple's Metal API for hardware acceleration.

gentic.news Analysis

This release represents a maturation point for the mlx-vlm project, which we first covered in December 2025 when it reached version 0.3.0 with initial Gemma 2B support. The addition of TurboQuant Metal kernels marks a significant engineering achievement—achieving 1.9× decode speedup through kernel-level optimizations shows the team is moving beyond simple framework wrapping to genuine performance innovation.

The timing aligns with Apple's increased focus on on-device AI, particularly following their acquisition of DarwinAI in early 2025 and the integration of more advanced neural engines in their M4 and upcoming M5 chips. mlx-vlm's optimizations directly complement Apple's hardware roadmap, which emphasizes energy-efficient inference for sustained AI workloads.

Falcon-Perception 300M's inclusion is noteworthy as it represents TII's first vision model in the mlx ecosystem. This follows TII's pattern of releasing efficient models (their Falcon-7B was notable for its performance-per-parameter ratio) and suggests they're targeting the edge inference market where Apple Silicon dominates. The 300M parameter size is particularly suited for mobile deployment, potentially enabling new categories of on-device visual assistants.

The VisionFeatureCache innovation addresses a real-world usability issue that has persisted in multimodal chat applications. By eliminating redundant image encoding, this could make multi-turn visual conversations significantly more responsive—a critical improvement for applications like coding assistants with screenshot context or document analysis tools.

Looking forward, the 89% KV cache savings for long contexts could enable more complex visual reasoning tasks on consumer hardware. As context windows continue to expand across the industry (with models now routinely supporting 128K+ tokens), efficient cache management becomes increasingly critical for practical deployment.

Frequently Asked Questions

What is mlx-vlm and who maintains it?

mlx-vlm is an open-source library that enables running vision-language models on Apple Silicon Macs using Apple's Metal Performance Shaders framework. It's maintained by independent developers (including @Prince_Canuma) and provides a Python interface similar to Hugging Face's transformers library but optimized for Apple hardware.

How does the 1.9× decode speed improvement affect real applications?

The decode speed improvement directly reduces latency when the model generates text responses. For interactive applications like chatbots with visual context, coding assistants that analyze screenshots, or document understanding tools, this means faster responses and smoother user experiences. The improvement is most noticeable in longer conversations where multiple decoding steps occur.

Can I use mlx-vlm with existing Hugging Face models?

mlx-vlm supports specific model architectures that have been ported to the mlx framework. While not all Hugging Face models are compatible, the library supports popular architectures like LLaVA, Gemma, and now Falcon-Perception. You typically need to convert weights from standard formats to mlx-compatible formats, which the library's documentation explains.

What Apple hardware is required for these optimizations?

The TurboQuant Metal kernels require Apple Silicon Macs (M1, M2, M3, or M4 series) with macOS Sonoma or later. The optimizations specifically leverage the Neural Engine and GPU capabilities through Metal Performance Shaders, so performance improvements will vary based on your specific chip generation and memory configuration.

Following this story?

Get a weekly digest with AI predictions, trends, and analysis — free.

AI Analysis

This mlx-vlm update represents a strategic optimization play rather than a fundamental architectural breakthrough. The 1.9× decode speedup through TurboQuant kernels demonstrates that substantial performance gains remain available through framework-level optimizations, even as model architecture improvements show diminishing returns. This aligns with the broader industry trend we've observed throughout 2025-2026: after the initial rush to support Apple Silicon, projects are now entering a refinement phase focused on squeezing maximum performance from the hardware. The inclusion of Falcon-Perception 300M is particularly interesting in context. TII has been aggressively positioning itself in the efficient model space, and this marks their expansion into the Apple ecosystem. Given Apple's walled-garden approach to hardware, successful frameworks like mlx-vlm become crucial gateways for model providers wanting to reach Apple's developer base. We saw similar dynamics with Core ML adoption in 2024-2025, where early framework support often determined which models gained traction in iOS/macOS applications. The VisionFeatureCache innovation addresses what has been a persistent but under-discussed inefficiency in multimodal systems. Most research papers focus on single-turn evaluations, but real applications involve multi-turn conversations where users reference the same image repeatedly. This practical optimization suggests the mlx-vlm team is building based on real usage patterns rather than just benchmark performance—a healthy sign for the project's maturity. As we move into 2026, expect more frameworks to implement similar caching mechanisms once the pattern proves effective in production deployments.
Enjoyed this article?
Share:

Related Articles

More in Products & Launches

View all