Qwen 3.5 397B-A17B MoE Model Runs on M3 Mac at 5.7 TPS with 5.5GB Active Memory via SSD Streaming
AI ResearchScore: 85

Qwen 3.5 397B-A17B MoE Model Runs on M3 Mac at 5.7 TPS with 5.5GB Active Memory via SSD Streaming

Developer Dan reportedly runs the 209GB Qwen 3.5 397B-A17B MoE model on an M3 Mac at ~5.7 tokens per second using only 5.5GB of active memory by quantizing and streaming weights from SSD.

3h ago·2 min read·8 views·via @simonw
Share:

Qwen 3.5 397B-A17B MoE Model Runs on M3 Mac at 5.7 TPS with 5.5GB Active Memory via SSD Streaming

What Happened

According to a report shared by Simon Willison, a developer named Dan has successfully run the Qwen 3.5 397B-A17B model—a 209GB Mixture of Experts (MoE) model—on an Apple M3 Mac. The system achieves approximately 5.7 tokens per second while using only 5.5GB of active memory.

The key technical approach involves quantizing the model weights and then streaming them from the SSD during inference. This works particularly well with MoE architectures because each token's computation only activates a small subset of the model's total parameters. The SSD in question reportedly provides bandwidth of around 17GB/s, enabling sufficient throughput for this streaming approach.

Context

The Qwen 3.5 397B-A17B is part of Alibaba's Qwen 2.5 series of large language models. As a Mixture of Experts model, it contains 397 billion total parameters but only routes each input through a fraction of them—typically 2-4 experts per token—making the active parameter count during inference much lower than the total size.

Running models of this scale on consumer hardware typically requires either:

  • Aggressive quantization (e.g., 4-bit or lower)
  • Offloading weights to system memory or storage
  • Specialized optimization for MoE architectures

This implementation appears to combine all three approaches, leveraging the M3 Mac's fast SSD (reportedly ~17GB/s) to stream weights as needed rather than loading the entire model into RAM.

Technical Implications

While specific implementation details aren't provided in the source, the reported numbers suggest several technical achievements:

  1. Memory Efficiency: 5.5GB active memory for a 397B parameter model implies extreme quantization—likely 2-bit or mixed precision—combined with dynamic loading.
  2. Storage Bandwidth Utilization: The ~17GB/s SSD bandwidth is sufficient to keep up with the 5.7 tokens/second generation rate, given MoE's sparse activation patterns.
  3. MoE Optimization: The approach specifically exploits MoE sparsity, where only 14B-28B parameters might be active per token (assuming 2-4 experts of 7B each).

This demonstration shows that with proper optimization, even massive MoE models can run on consumer hardware without requiring hundreds of gigabytes of RAM, opening possibilities for local deployment of frontier-scale models.

AI Analysis

This report, if verified with benchmarks, represents a significant engineering achievement in efficient inference for massive MoE models. The 5.5GB active memory usage for a 397B parameter model suggests aggressive 2-bit quantization or similar techniques, combined with just-in-time loading of expert weights from SSD. For comparison, loading a full-precision 397B model would require ~800GB of memory, while even 4-bit quantization would need ~200GB. The 5.7 tokens/second throughput on consumer hardware is particularly notable. While slower than high-end GPU inference, it's usable for many applications and demonstrates that SSD bandwidth (~17GB/s on M3 Macs) can be sufficient for MoE models where only small subsets of weights are needed per token. This approach effectively trades latency for accessibility, making frontier-scale models runnable on hardware that costs under $2,000 rather than requiring $100,000+ GPU clusters. Practitioners should note this validates several emerging trends: 1) MoE architectures are particularly amenable to memory-constrained deployment, 2) storage bandwidth is becoming a critical factor for large model inference, and 3) extreme quantization (2-bit and below) combined with smart caching can enable surprisingly good performance. The next logical step would be to benchmark quality retention at these quantization levels and compare against smaller dense models running at higher precision.
Original sourcex.com

Trending Now

More in AI Research

Browse more AI articles