![jason-schulz/Gemma-4-26B-A4B-Hermes-VLM-MLX · Hugging Face](https://cdn-thumbnails.huggingface.co/social-thumbnails/models/jason-schulz/Gemma-4-26B-A4B-Hermes-VLM-MLX.png)

Listen to today's AI briefing

Daily podcast — 5 min, AI-narrated summary of top stories

Screenshot of mlx-vlm v0.6.2 release notes highlighting Gemma 4 QAT support for local GPU deployment

Products & LaunchesScore: 100

mlx-vlm v0.6.2 Adds Gemma 4 QAT Support for Local GPUs

mlx-vlm v0.6.2 adds launch-day support for Google DeepMind's Gemma 4 QAT checkpoints, enabling local inference on consumer GPUs and edge devices with video input for the 12B model.

AAAla SMITH & AI Research Desk·Jun 5, 2026·3 min read··216 views·AI-Generated·Report error

Source: x.comvia @Prince_CanumaWidely Reported

What is new in mlx-vlm v0.6.2?

mlx-vlm v0.6.2 adds launch-day support for Google DeepMind's Gemma 4 quantization-aware training checkpoints, enabling local inference on consumer GPUs and edge devices with video input for the 12B model.

TL;DR

mlx-vlm v0.6.2 released with Gemma 4 QAT support · Gemma 4 QAT checkpoints target consumer GPUs · Adds video input for Gemma 4 12B

mlx-vlm v0.6.2 launched with launch-day support for Google DeepMind's Gemma 4 quantization-aware training (QAT) checkpoints. The release enables running compressed Gemma 4 models locally on consumer GPUs and edge devices.

Key facts

mlx-vlm v0.6.2 released with Gemma 4 QAT support
Gemma 4 QAT checkpoints from Google DeepMind
Video input support for Gemma 4 12B model
Optimized for consumer GPUs and edge devices
Launch-day partnership with Google DeepMind

The mlx-vlm team announced via X that version 0.6.2 integrates Google DeepMind's newly released Gemma 4 QAT checkpoints. These checkpoints are quantization-aware trained, designed to compress the model while retaining accuracy for local inference on consumer hardware.

The update includes reliability fixes specific to Gemma 4, and adds video input support for the Gemma 4 12B variant. An APC fix for single requests is also included. The release positions mlx-vlm as a launch-day partner for Google DeepMind's QAT release, which aims to make Gemma 4 more accessible outside of data center GPUs.

Quantization-aware training allows models to be compressed without the typical accuracy loss seen in post-training quantization. The Gemma 4 QAT checkpoints are optimized for consumer GPUs and edge devices, meaning larger models can run on less powerful hardware. The mlx-vlm framework is built on Apple's MLX library, which is optimized for Apple Silicon, but the checkpoints themselves are model-agnostic.

What the release doesn't say

PaliGemma: A Lightweight Open-Source VLM for Image Analysis and ...

The announcement does not disclose benchmark performance for the QAT models versus the full-precision Gemma 4. No latency or memory figures are provided. The team also does not specify which quantization bit-widths are supported (e.g., 4-bit, 8-bit), though QAT typically targets 4-bit or 8-bit inference. The model collection link in the tweet is not expanded, so the exact model sizes available remain unconfirmed beyond the 12B variant.

Why this matters

Google DeepMind releasing QAT checkpoints on launch day is a shift. Previously, quantization was a post-hoc step performed by third parties (e.g., llama.cpp, AutoGPTQ). By baking quantization into training, Google ensures the compressed models maintain fidelity, reducing the need for community calibration datasets. For mlx-vlm, being a launch-day partner signals that the framework is now a first-class deployment target for Google's open models, similar to how Hugging Face Transformers is for PyTorch.

What to watch

Watch for benchmark comparisons between Gemma 4 QAT and full-precision models on common edge benchmarks (e.g., MLPerf Edge). Also watch for Google releasing QAT checkpoints for future open models on launch day, which would indicate a permanent shift in deployment strategy.

Source: gentic.news · Jun 5, 2026 · author=Ala SMITH · citation.json

AI-assisted reporting. Generated by gentic.news from multiple verified sources, fact-checked against the Living Graph of 4,300+ entities. Edited by Ala SMITH.

Following this story?

Get a weekly digest with AI predictions, trends, and analysis — free.

AI Analysis

This release is a tactical move by Google DeepMind to control the quantization narrative. By releasing QAT checkpoints on launch day, Google ensures that compressed versions of Gemma 4 are official rather than community-created. This reduces the risk of accuracy degradation from third-party quantization, which has been a persistent issue for open models. For mlx-vlm, this partnership is a validation of the framework's relevance beyond Apple's ecosystem. However, the lack of any performance numbers makes it impossible to assess whether QAT actually delivers on its promise of lossless compression. The community will likely test this within days, and any quality degradation will be immediately visible on leaderboards like Open LLM.

#open-source #edge-ai #model-compression

This story is part of

The AI Infrastructure War Shifts from Chips to Developer Tools

Nvidia's enterprise pivot and AWS's OpenAI bet collide with Cursor's quiet ascent

Compare side-by-side

mlx-vlm vs MLX

→

Mentioned in this article

mlx-vlm Gemma 4 Google MLX

Enjoyed this article?