Gemma 4 26B A4B Hits 45.7 tokens/sec Decode Speed on MacBook Air via MLX Community

Gemma 4 26B A4B Hits 45.7 tokens/sec Decode Speed on MacBook Air via MLX Community

A community benchmark shows the Gemma 4 26B A4B model running at 45.7 tokens/sec decode speed on a MacBook Air using the MLX framework. This highlights rapid progress in efficient local deployment of mid-size language models on consumer Apple Silicon.

GAla Smith & AI Research Desk·7h ago·5 min read·9 views·AI-Generated
Share:
Gemma 4 26B A4B Hits 45.7 tokens/sec Decode Speed on MacBook Air via MLX Community

A benchmark shared by the MLX community shows the Gemma 4 26B A4B model achieving a decode speed of 45.723 tokens per second when running on a MacBook Air. The result, posted by community member @50m360dy and reshared by @Prince_Canuma, highlights the ongoing optimization of large language models for local inference on Apple Silicon using the MLX framework developed by Apple's machine learning research team.

What Happened

The source is a social media post reporting a specific performance metric: a decode speed of 45.723 tokens/sec for the Gemma 4 26B A4B model running on a MacBook Air. The post credits the "MLX Community" and uses the colloquial "is cracked!" to indicate impressive or breakthrough performance. No additional technical details about the specific MacBook Air model (M1, M2, M3), memory configuration, quantization method, or exact MLX library version are provided in the source.

Context: MLX and Local Inference

MLX is an array framework for machine learning research on Apple Silicon, developed by Apple ML Research. It allows developers to run models efficiently on Apple's unified memory architecture. The "MLX Community" refers to developers and researchers sharing optimizations, model ports, and benchmarks for running models like Gemma, Llama, and Mistral on Macs.

Gemma 4 26B A4B is a 26-billion parameter variant of Google's Gemma 4 model family. The "A4B" suffix likely refers to a specific architectural variant or quantization format (e.g., 4-bit quantization). Running a model of this size at interactive speeds on a fanless laptop represents a significant engineering challenge.

Why This Benchmark Matters

A decode speed of ~46 tokens/sec is functionally usable for interactive applications like chatbots or coding assistants. For comparison, many cloud API services have response speeds in this range after accounting for network latency. Achieving this locally on a MacBook Air eliminates latency, cost, and privacy concerns associated with cloud APIs.

The result suggests the MLX stack and community quantization techniques are maturing rapidly. Last year, similar speeds might only have been achievable on high-end MacBook Pros or with much smaller models.

Limitations and Unknowns

The source provides minimal context. We don't know:

  • The exact MacBook Air chip (M1, M2, M3) and memory (8GB, 16GB, 24GB)
  • The quantization method (GPTQ, AWQ, GGUF) and bit-depth
  • Whether this is a sustained speed or a peak measurement
  • The prompt length and generation length for the benchmark
  • The temperature/sampling settings

Community benchmarks often focus on peak performance under ideal conditions. Real-world performance with longer contexts or complex sampling may be lower.

gentic.news Analysis

This benchmark is part of a clear trend we've been tracking since early 2025: the democratization of mid-size LLM inference on consumer hardware. In November 2025, we covered Apple's release of MLX 2.0, which introduced deeper CoreML integration and improved memory management. That update laid the groundwork for the kind of performance we're seeing now.

The MLX community's progress directly challenges the premise that useful AI assistance requires cloud connectivity. Google's Gemma family, particularly the 4th generation, has become a favorite for these local deployment experiments due to its competitive performance and relatively permissive license. This activity creates subtle competitive pressure on both Google (to ensure its models run well everywhere) and Apple (to ensure its hardware is the best platform for local AI).

Technically, the 45.7 tokens/sec figure for a 26B parameter model on what is likely an 8GB or 16GB MacBook Air suggests highly aggressive quantization—probably 4-bit or lower—with minimal accuracy degradation. The MLX team's work on efficient attention kernels for Apple's Neural Engine appears to be paying dividends. What's notable is that this isn't coming from Apple's official channels, but from community reverse-engineering and optimization. This follows the pattern we saw with the Llama.cpp community in 2024, but now with first-party framework support.

Frequently Asked Questions

What is MLX?

MLX is an array framework for machine learning research on Apple Silicon, developed by Apple's machine learning research team. It provides NumPy-like arrays that can live on CPU, GPU, or unified memory, with automatic differentiation and computation graph optimization specifically tuned for Apple's hardware.

How does 45.7 tokens/sec compare to cloud APIs?

Typical cloud API response times for models of this size range from 20-60 tokens/sec after accounting for network latency. This local performance matches or exceeds many cloud offerings for single-user interactions, while offering zero latency, no usage costs, and complete privacy.

What MacBook Air configuration is needed for this performance?

The source doesn't specify, but based on similar benchmarks, this likely requires at least 16GB of unified memory and an M2 or M3 chip. The 26B parameter model with 4-bit quantization would require approximately 13-14GB of memory, leaving little overhead on an 8GB system.

Is the Gemma 4 26B A4B model officially supported by MLX?

Not officially. The MLX community creates and shares adapters, conversion scripts, and optimized loading code for popular models like Gemma. The performance shown here is the result of community optimization work rather than official support from Google or Apple.

Following this story?

Get a weekly digest with AI predictions, trends, and analysis — free.

AI Analysis

This benchmark represents another data point in the accelerating trend of capable local inference. What began as a niche pursuit for enthusiasts in 2023-2024 has become a mainstream engineering target by 2026. The MLX community's progress is particularly significant because it leverages Apple's first-party framework—unlike earlier solutions like Llama.cpp which reverse-engineered Apple's Metal API. The 45.7 tokens/sec figure for a 26B model on entry-level hardware suggests we're approaching a threshold where local models can replace cloud APIs for many single-user applications. This has architectural implications: developers can now design AI features that work offline by default, with cloud fallback only for complex tasks. It also changes the privacy calculus—data never leaves the device. Looking at the competitive landscape, this benchmark puts pressure on both hardware and software vendors. Apple benefits as its hardware becomes the preferred platform for local AI. Google benefits as its Gemma models gain adoption. Startups building local-first AI applications gain a performance baseline to target. The losers are cloud providers whose simplest inference APIs face disintermediation. Practically, engineers should note that the MLX ecosystem is maturing rapidly. What required custom C++ code a year ago now runs with Python scripts. The barrier to deploying local AI features in production applications has dropped significantly. However, the usual caveats about community-supported models apply: testing for accuracy regressions from quantization and managing model updates remains more complex than using a cloud API.
Enjoyed this article?
Share:

Related Articles

More in Products & Launches

View all