mlx-vlm v0.6.2 launched with launch-day support for Google DeepMind's Gemma 4 quantization-aware training (QAT) checkpoints. The release enables running compressed Gemma 4 models locally on consumer GPUs and edge devices.
Key facts
- mlx-vlm v0.6.2 released with Gemma 4 QAT support
- Gemma 4 QAT checkpoints from Google DeepMind
- Video input support for Gemma 4 12B model
- Optimized for consumer GPUs and edge devices
- Launch-day partnership with Google DeepMind
The mlx-vlm team announced via X that version 0.6.2 integrates Google DeepMind's newly released Gemma 4 QAT checkpoints. These checkpoints are quantization-aware trained, designed to compress the model while retaining accuracy for local inference on consumer hardware.
The update includes reliability fixes specific to Gemma 4, and adds video input support for the Gemma 4 12B variant. An APC fix for single requests is also included. The release positions mlx-vlm as a launch-day partner for Google DeepMind's QAT release, which aims to make Gemma 4 more accessible outside of data center GPUs.
Quantization-aware training allows models to be compressed without the typical accuracy loss seen in post-training quantization. The Gemma 4 QAT checkpoints are optimized for consumer GPUs and edge devices, meaning larger models can run on less powerful hardware. The mlx-vlm framework is built on Apple's MLX library, which is optimized for Apple Silicon, but the checkpoints themselves are model-agnostic.
What the release doesn't say

The announcement does not disclose benchmark performance for the QAT models versus the full-precision Gemma 4. No latency or memory figures are provided. The team also does not specify which quantization bit-widths are supported (e.g., 4-bit, 8-bit), though QAT typically targets 4-bit or 8-bit inference. The model collection link in the tweet is not expanded, so the exact model sizes available remain unconfirmed beyond the 12B variant.
Why this matters
![]()
Google DeepMind releasing QAT checkpoints on launch day is a shift. Previously, quantization was a post-hoc step performed by third parties (e.g., llama.cpp, AutoGPTQ). By baking quantization into training, Google ensures the compressed models maintain fidelity, reducing the need for community calibration datasets. For mlx-vlm, being a launch-day partner signals that the framework is now a first-class deployment target for Google's open models, similar to how Hugging Face Transformers is for PyTorch.
What to watch
Watch for benchmark comparisons between Gemma 4 QAT and full-precision models on common edge benchmarks (e.g., MLPerf Edge). Also watch for Google releasing QAT checkpoints for future open models on launch day, which would indicate a permanent shift in deployment strategy.








