Listen to today's AI briefing

Daily podcast — 5 min, AI-narrated summary of top stories

Developer Dan runs the 209GB Qwen 3.5 397B-A17B MoE model on an M3 Mac, achieving ~5.7 tokens per second with 5.5GB…

Qwen 3.5 397B-A17B MoE Model Runs on M3 Mac at 5.7 TPS with 5.5GB Active Memory via SSD Streaming

Developer Dan reportedly runs the 209GB Qwen 3.5 397B-A17B MoE model on an M3 Mac at ~5.7 tokens per second using only 5.5GB of active memory by quantizing and streaming weights from SSD.

AAAla SMITH & AI Research Desk·Mar 18, 2026·2 min read··193 views·AI-Generated·Report error

Source: x.comvia @simonwSingle Source

What Happened

According to a report shared by Simon Willison, a developer named Dan has successfully run the Qwen 3.5 397B-A17B model—a 209GB Mixture of Experts (MoE) model—on an Apple M3 Mac. The system achieves approximately 5.7 tokens per second while using only 5.5GB of active memory.

The key technical approach involves quantizing the model weights and then streaming them from the SSD during inference. This works particularly well with MoE architectures because each token's computation only activates a small subset of the model's total parameters. The SSD in question reportedly provides bandwidth of around 17GB/s, enabling sufficient throughput for this streaming approach.

Context

The Qwen 3.5 397B-A17B is part of Alibaba's Qwen 2.5 series of large language models. As a Mixture of Experts model, it contains 397 billion total parameters but only routes each input through a fraction of them—typically 2-4 experts per token—making the active parameter count during inference much lower than the total size.

Running models of this scale on consumer hardware typically requires either:

Aggressive quantization (e.g., 4-bit or lower)
Offloading weights to system memory or storage
Specialized optimization for MoE architectures

This implementation appears to combine all three approaches, leveraging the M3 Mac's fast SSD (reportedly ~17GB/s) to stream weights as needed rather than loading the entire model into RAM.

Technical Implications

While specific implementation details aren't provided in the source, the reported numbers suggest several technical achievements:

Memory Efficiency: 5.5GB active memory for a 397B parameter model implies extreme quantization—likely 2-bit or mixed precision—combined with dynamic loading.
Storage Bandwidth Utilization: The ~17GB/s SSD bandwidth is sufficient to keep up with the 5.7 tokens/second generation rate, given MoE's sparse activation patterns.
MoE Optimization: The approach specifically exploits MoE sparsity, where only 14B-28B parameters might be active per token (assuming 2-4 experts of 7B each).

This demonstration shows that with proper optimization, even massive MoE models can run on consumer hardware without requiring hundreds of gigabytes of RAM, opening possibilities for local deployment of frontier-scale models.

Source: gentic.news · Mar 18, 2026 · author=Ala SMITH · citation.json

AI-assisted reporting. Generated by gentic.news from multiple verified sources, fact-checked against the Living Graph of 4,300+ entities. Edited by Ala SMITH.

Following this story?

Get a weekly digest with AI predictions, trends, and analysis — free.

AI Analysis

This report, if verified with benchmarks, represents a significant engineering achievement in efficient inference for massive MoE models. The 5.5GB active memory usage for a 397B parameter model suggests aggressive 2-bit quantization or similar techniques, combined with just-in-time loading of expert weights from SSD. For comparison, loading a full-precision 397B model would require ~800GB of memory, while even 4-bit quantization would need ~200GB. The 5.7 tokens/second throughput on consumer hardware is particularly notable. While slower than high-end GPU inference, it's usable for many applications and demonstrates that SSD bandwidth (~17GB/s on M3 Macs) can be sufficient for MoE models where only small subsets of weights are needed per token. This approach effectively trades latency for accessibility, making frontier-scale models runnable on hardware that costs under $2,000 rather than requiring $100,000+ GPU clusters. Practitioners should note this validates several emerging trends: 1) MoE architectures are particularly amenable to memory-constrained deployment, 2) storage bandwidth is becoming a critical factor for large model inference, and 3) extreme quantization (2-bit and below) combined with smart caching can enable surprisingly good performance. The next logical step would be to benchmark quality retention at these quantization levels and compare against smaller dense models running at higher precision.

#inference-optimization #edge-ai #quantization #mixture-of-experts

Mentioned in this article

Qwen 3.5 397B-A17B Apple M3 Mac Alibaba Mixture of Experts (Sparse MoE for LLMs)quantization Simon Willison

Enjoyed this article?

Get the weekly AI intelligence briefing

✨AI Toolslive

Five one-click lenses on this article. Cached for 24h.

Pick a tool above to generate an instant lens on this article.

AI Research2 shared topics

Chinese LLMs Surge on OpenRouter as U.S. AI Traffic Shifts

From the lab

The framework underneath this story

Every article on this site sits on top of one engine and one framework — both built by the lab.

Original research · EUMAS 2026

MNEMA — A Witness Lattice for Multi-Agent AI Memory

Cryptographic memory units · 1−α detection floor · 15 pp PDF

Field framework · v1.0

Epistemic Infrastructure

12 pillars · 11-stage knowledge metabolism · pathology catalog

More in AI Research

View all

Two researchers in a lab analyzing a chart showing cost reduction, with a laptop displaying a graph of annotation…

AI Research

Metric Match Cuts LLM Judge Annotation Cost 32.5% via Subset Selection

MIT and Stanford researchers developed Metric Match, a subset selection method that reduces LLM judge annotation costs by 32.5% and estimation error by 18.7%, achieving a 0.838 win-rate against random selection.

arxiv.org/1d ago/3 min read

paperresearchllm

AI Research

Visual-Seeker: Active Visual Reasoning Beats Proprietary MLLMs on 5 Benchmarks

Visual-Seeker achieves SOTA on five multimodal search benchmarks, surpassing proprietary models by actively harvesting visual evidence during search.

arxiv.org/1d ago/3 min read

agentsresearchmultimodal

Researchers analyze fusion strategies on a computer dashboard displaying patient data and survival curves for PE…

AI Research

No single fusion strategy wins

Zhang et al. test 4 fusion strategies on 7K+ patients, finding no universal best. Contrastive alignment with CLMBR wins for PE mortality; cross-attention and co-attention split for CVD.

arxiv.org/1d ago/3 min read

healthcare aimultimodal learningai research

What Happened

Context

Technical Implications

AI Analysis

✨AI Toolslive

Related Articles

Qwen3.6-27B: How to Run a 17GB Local Model That Beats 397B MoE on Coding Tasks

CoreWeave Trains DeepSeek-V3 in 2 Minutes, Claims MLPerf v6.0 Record

Google Open-Sources DiffusionGemma, 26B Model Hits 1K Tokens/Sec on H100

Stanford, Meta 'Code as Agent Harness' Paper Rethinks AI Agent Design

Selective Attackers Cut Agent Safety by 28pp, Paper Finds

Chinese LLMs Surge on OpenRouter as U.S. AI Traffic Shifts

The framework underneath this story

More in AI Research

Metric Match Cuts LLM Judge Annotation Cost 32.5% via Subset Selection

Visual-Seeker: Active Visual Reasoning Beats Proprietary MLLMs on 5 Benchmarks

No single fusion strategy wins