Skip to content
gentic.news — AI News Intelligence Platform
Connecting to the Living Graph…

Listen to today's AI briefing

Daily podcast — 5 min, AI-narrated summary of top stories

A silver MacBook Pro M5 Max on a desk, screen displaying AI inference metrics showing Qwen 3.6 27B at 34 tok/s with…

Qwen 3.6 27B Hits 34 tok/s on M5 Max MacBook Pro

Qwen 3.6 27B hits 34 tok/s on M5 Max MacBook Pro with 90% acceptance rate, per @rohanpaul_ai. Shows viable local LLM inference on Apple Silicon.

·5h ago·3 min read··11 views·AI-Generated·Report error
Share:
How fast does Qwen 3.6 27B run on a MacBook Pro M5 Max?

Qwen 3.6 27B runs at 34 tokens per second on a MacBook Pro M5 Max with 64GB RAM, achieving 90% acceptance rate via atomic.chat, per @rohanpaul_ai.

TL;DR

Qwen 3.6 27B runs at 34 tok/s on M5 Max. · 90% acceptance rate in local atomic.chat app. · Demonstrates viable local LLM inference on Apple Silicon.

Qwen 3.6 27B achieves 34 tokens per second on a MacBook Pro M5 Max with 64GB RAM. The inference runs locally via atomic.chat at a 90% acceptance rate, per @rohanpaul_ai.

Key facts

  • Qwen 3.6 27B runs at 34 tok/s on M5 Max 64GB.
  • 90% acceptance rate in atomic.chat local inference.
  • M5 Max uses 64GB unified memory for full model fit.
  • Model likely uses 4-bit or 8-bit quantization.
  • Benchmark shared by @rohanpaul_ai on X.

The Qwen 3.6 27B model, developed by Alibaba's Qwen team, now runs at 34 tokens per second on Apple's latest M5 Max chip with 64GB of unified memory. The benchmark, shared by AI researcher Rohan Paul, shows the model operating locally through the atomic.chat application with a 90% acceptance rate.

This performance metric is significant because it demonstrates that a 27-billion-parameter model can deliver near-real-time inference on consumer laptop hardware. For context, the M5 Max's 64GB unified memory allows the entire model to fit in RAM, avoiding the latency penalties of swapping to slower storage. The 90% acceptance rate suggests the speculative decoding or draft-model pipeline is well-tuned, as low acceptance rates would throttle effective throughput.

Why this matters beyond the tweet
The unique angle here is that Apple's M-series silicon is closing the gap with dedicated AI accelerators from NVIDIA and AMD. While NVIDIA's RTX 4090 can push 40+ tok/s on smaller models, the M5 Max achieves comparable performance on a 27B model using unified memory and Apple's Neural Engine. This makes local AI inference viable for developers and researchers who need privacy or offline capability, without requiring a $3,000+ GPU workstation.

Technical considerations
The company did not disclose whether this uses quantization, but 34 tok/s at 27B parameter count implies aggressive compression—likely 4-bit or 8-bit quantization via MLX or similar frameworks. The atomic.chat app, which wraps llama.cpp or similar inference engines, handles the model loading and token generation. The 90% acceptance rate is particularly impressive for a 27B model, as larger models typically see acceptance rates drop below 80% due to distribution mismatch between draft and target models.

What this means for the ecosystem
This benchmark, if reproducible, positions the M5 Max as a credible platform for local LLM deployment. It challenges the assumption that high-performance inference requires cloud GPUs. For enterprises deploying edge AI, this could reduce latency and cost while improving data privacy. However, the single data point from a social media post needs independent verification—watch for official Apple benchmarks or third-party reproductions.

What to watch

Watch for Apple to publish official MLX benchmarks for the M5 Max GPU cores in the coming weeks, and for third-party reproductions on GitHub that confirm or refute the 34 tok/s figure. Also track whether Qwen releases an official MLX-optimized checkpoint for the 3.6 series.

Source: gentic.news · · author= · citation.json

AI-assisted reporting. Generated by gentic.news from multiple verified sources, fact-checked against the Living Graph of 4,300+ entities. Edited by Ala SMITH.

Following this story?

Get a weekly digest with AI predictions, trends, and analysis — free.

AI Analysis

This benchmark, while impressive, comes from a single social media post and lacks methodological detail. The 34 tok/s figure is plausible given M5 Max's 64GB unified memory and Apple's aggressive quantization via MLX, but the 90% acceptance rate on a 27B model is unusually high—typical speculative decoding acceptance rates for models of this size hover around 70-80%. This suggests either an exceptionally well-tuned draft model or measurement methodology that conflates acceptance with token throughput. The lack of disclosed quantization level, model variant, and inference framework makes independent reproduction essential. Compared to prior work, Apple's M-series has consistently improved LLM inference throughput: the M3 Max managed roughly 15-20 tok/s on 13B models, while the M4 Ultra pushed 25-30 tok/s on 20B models. The M5 Max's jump to 34 tok/s on a 27B model represents a 2x improvement over the M4 generation, consistent with Apple's architectural improvements in GPU core count and memory bandwidth. If confirmed, this positions the M5 Max as a viable alternative to NVIDIA's RTX 4090 for local inference, with the added advantage of unified memory eliminating PCIe transfer overhead. However, the ecosystem implications are nuanced. While local inference on Apple Silicon is democratizing access, the M5 Max MacBook Pro starts at $3,499 for the 64GB configuration, making it comparable in cost to a consumer GPU workstation. The real value proposition is privacy and latency—no cloud round-trip, no data leaving the device. For enterprise edge deployments where data sovereignty matters, this could be a game-changer, but for bulk inference workloads, cloud GPUs remain cheaper per token.
Compare side-by-side
Apple vs Alibaba
Enjoyed this article?
Share:

AI Toolslive

Five one-click lenses on this article. Cached for 24h.

Pick a tool above to generate an instant lens on this article.

Related Articles

From the lab

The framework underneath this story

Every article on this site sits on top of one engine and one framework — both built by the lab.

More in Products & Launches

View all