Skip to content
gentic.news — AI News Intelligence Platform
Connecting to the Living Graph…

Listen to today's AI briefing

Daily podcast — 5 min, AI-narrated summary of top stories

Screenshot of mlx-vlm v0.6.2 release notes highlighting Gemma 4 QAT support for local GPU deployment

mlx-vlm v0.6.2 Adds Gemma 4 QAT Support for Local GPUs

mlx-vlm v0.6.2 adds launch-day support for Google DeepMind's Gemma 4 QAT checkpoints, enabling local inference on consumer GPUs and edge devices with video input for the 12B model.

·15h ago·3 min read··27 views·AI-Generated·Report error
Share:
What is new in mlx-vlm v0.6.2?

mlx-vlm v0.6.2 adds launch-day support for Google DeepMind's Gemma 4 quantization-aware training checkpoints, enabling local inference on consumer GPUs and edge devices with video input for the 12B model.

TL;DR

mlx-vlm v0.6.2 released with Gemma 4 QAT support · Gemma 4 QAT checkpoints target consumer GPUs · Adds video input for Gemma 4 12B

mlx-vlm v0.6.2 launched with launch-day support for Google DeepMind's Gemma 4 quantization-aware training (QAT) checkpoints. The release enables running compressed Gemma 4 models locally on consumer GPUs and edge devices.

Key facts

  • mlx-vlm v0.6.2 released with Gemma 4 QAT support
  • Gemma 4 QAT checkpoints from Google DeepMind
  • Video input support for Gemma 4 12B model
  • Optimized for consumer GPUs and edge devices
  • Launch-day partnership with Google DeepMind

The mlx-vlm team announced via X that version 0.6.2 integrates Google DeepMind's newly released Gemma 4 QAT checkpoints. These checkpoints are quantization-aware trained, designed to compress the model while retaining accuracy for local inference on consumer hardware.

The update includes reliability fixes specific to Gemma 4, and adds video input support for the Gemma 4 12B variant. An APC fix for single requests is also included. The release positions mlx-vlm as a launch-day partner for Google DeepMind's QAT release, which aims to make Gemma 4 more accessible outside of data center GPUs.

Quantization-aware training allows models to be compressed without the typical accuracy loss seen in post-training quantization. The Gemma 4 QAT checkpoints are optimized for consumer GPUs and edge devices, meaning larger models can run on less powerful hardware. The mlx-vlm framework is built on Apple's MLX library, which is optimized for Apple Silicon, but the checkpoints themselves are model-agnostic.

What the release doesn't say

PaliGemma: A Lightweight Open-Source VLM for Image Analysis and ...

The announcement does not disclose benchmark performance for the QAT models versus the full-precision Gemma 4. No latency or memory figures are provided. The team also does not specify which quantization bit-widths are supported (e.g., 4-bit, 8-bit), though QAT typically targets 4-bit or 8-bit inference. The model collection link in the tweet is not expanded, so the exact model sizes available remain unconfirmed beyond the 12B variant.

Why this matters

jason-schulz/Gemma-4-26B-A4B-Hermes-VLM-MLX · Hugging Face

Google DeepMind releasing QAT checkpoints on launch day is a shift. Previously, quantization was a post-hoc step performed by third parties (e.g., llama.cpp, AutoGPTQ). By baking quantization into training, Google ensures the compressed models maintain fidelity, reducing the need for community calibration datasets. For mlx-vlm, being a launch-day partner signals that the framework is now a first-class deployment target for Google's open models, similar to how Hugging Face Transformers is for PyTorch.

What to watch

Watch for benchmark comparisons between Gemma 4 QAT and full-precision models on common edge benchmarks (e.g., MLPerf Edge). Also watch for Google releasing QAT checkpoints for future open models on launch day, which would indicate a permanent shift in deployment strategy.

Source: gentic.news · · author= · citation.json

AI-assisted reporting. Generated by gentic.news from multiple verified sources, fact-checked against the Living Graph of 4,300+ entities. Edited by Ala SMITH.

Following this story?

Get a weekly digest with AI predictions, trends, and analysis — free.

AI Analysis

This release is a tactical move by Google DeepMind to control the quantization narrative. By releasing QAT checkpoints on launch day, Google ensures that compressed versions of Gemma 4 are official rather than community-created. This reduces the risk of accuracy degradation from third-party quantization, which has been a persistent issue for open models. For mlx-vlm, this partnership is a validation of the framework's relevance beyond Apple's ecosystem. However, the lack of any performance numbers makes it impossible to assess whether QAT actually delivers on its promise of lossless compression. The community will likely test this within days, and any quality degradation will be immediately visible on leaderboards like Open LLM.
Compare side-by-side
mlx-vlm vs MLX

Mentioned in this article

Enjoyed this article?
Share:

AI Toolslive

Five one-click lenses on this article. Cached for 24h.

Pick a tool above to generate an instant lens on this article.

Related Articles

From the lab

The framework underneath this story

Every article on this site sits on top of one engine and one framework — both built by the lab.

More in Products & Launches

View all