Qwen 3.5 397B-A17B MoE Model Runs on M3 Mac at 5.7 TPS with 5.5GB Active Memory via SSD Streaming
What Happened
According to a report shared by Simon Willison, a developer named Dan has successfully run the Qwen 3.5 397B-A17B model—a 209GB Mixture of Experts (MoE) model—on an Apple M3 Mac. The system achieves approximately 5.7 tokens per second while using only 5.5GB of active memory.
The key technical approach involves quantizing the model weights and then streaming them from the SSD during inference. This works particularly well with MoE architectures because each token's computation only activates a small subset of the model's total parameters. The SSD in question reportedly provides bandwidth of around 17GB/s, enabling sufficient throughput for this streaming approach.
Context
The Qwen 3.5 397B-A17B is part of Alibaba's Qwen 2.5 series of large language models. As a Mixture of Experts model, it contains 397 billion total parameters but only routes each input through a fraction of them—typically 2-4 experts per token—making the active parameter count during inference much lower than the total size.
Running models of this scale on consumer hardware typically requires either:
- Aggressive quantization (e.g., 4-bit or lower)
- Offloading weights to system memory or storage
- Specialized optimization for MoE architectures
This implementation appears to combine all three approaches, leveraging the M3 Mac's fast SSD (reportedly ~17GB/s) to stream weights as needed rather than loading the entire model into RAM.
Technical Implications
While specific implementation details aren't provided in the source, the reported numbers suggest several technical achievements:
- Memory Efficiency: 5.5GB active memory for a 397B parameter model implies extreme quantization—likely 2-bit or mixed precision—combined with dynamic loading.
- Storage Bandwidth Utilization: The ~17GB/s SSD bandwidth is sufficient to keep up with the 5.7 tokens/second generation rate, given MoE's sparse activation patterns.
- MoE Optimization: The approach specifically exploits MoE sparsity, where only 14B-28B parameters might be active per token (assuming 2-4 experts of 7B each).
This demonstration shows that with proper optimization, even massive MoE models can run on consumer hardware without requiring hundreds of gigabytes of RAM, opening possibilities for local deployment of frontier-scale models.






