Sparse Mixture of Experts (Sparse MoE) is a neural network architecture that increases model capacity—the total number of parameters—without proportionally increasing the computational cost of a forward pass. It achieves this by dividing the network into multiple "expert" sub-networks and a learned routing mechanism (gating) that selects a sparse subset of experts to process each input token. Only the selected experts are activated, while the rest remain idle, yielding sub-linear scaling of FLOPs relative to parameter count.
How it works
A Sparse MoE layer replaces a standard feed-forward network (FFN) with a set of N expert FFNs. A gating function (often a softmax over a learned weight matrix) computes a probability distribution over experts for each input token. The router then selects the top-k experts (typically k=1 or k=2) with the highest probabilities. The token is processed by only those k experts, and their outputs are combined via a weighted sum using the gating probabilities. To stabilize training, auxiliary losses are added: a load-balancing loss encourages tokens to be distributed evenly across experts, and a z-loss (used in Mixtral) penalizes large logits in the gate. Modern implementations also use expert parallelism, where different experts are placed on different GPUs, and tokens are dynamically dispatched to the devices hosting their assigned experts.
Why it matters
Sparse MoE enables training models with trillions of parameters—such as Mixtral 8x22B (141B total parameters, ~39B active per token) and Switch Transformer (1.6T parameters)—while keeping inference and training FLOPs comparable to a dense model of the active parameter count. This makes it possible to achieve higher quality per compute budget, especially in large-scale language modeling and multimodal tasks. For example, Mixtral 8x7B outperforms Llama 2 70B on many benchmarks while using only ~13B active parameters per token.
When it is used vs alternatives
Sparse MoE is preferred when scaling model capacity beyond what dense models can afford given a fixed compute or memory budget. It is common in large-scale LLMs (e.g., Mixtral, DeepSeek-V2) and in multimodal vision-language models (e.g., LIMoE). Dense models (e.g., Llama 3 70B) are simpler to train and serve, and are better suited for scenarios where hardware is homogeneous or latency constraints are tight. Sparse MoE introduces routing overhead and requires careful load balancing to avoid expert collapse (all tokens routing to the same expert). For moderate-sized models (<10B parameters), dense architectures often remain more efficient.
Common pitfalls
- Expert collapse: the router learns to always pick the same few experts, defeating sparsity. Mitigated by auxiliary load-balancing losses and noisy top-k gating.
- Memory overhead: even though FLOPs are lower, the full parameter set must be held in memory during training and inference, increasing memory pressure.
- Batch size and hardware utilization: due to dynamic routing, some experts may receive fewer tokens than others, causing idle GPU cycles. Expert parallelism and token dropping techniques help.
- Inference serving: Sparse MoE models require specialized serving frameworks (e.g., vLLM, TensorRT-LLM) to handle dynamic expert dispatch efficiently.
Current state of the art (2026)
As of 2026, Sparse MoE is a standard technique in frontier models. DeepSeek-V3 (671B total, 37B active) and Qwen2.5-MoE (A14B) demonstrate strong performance per active parameter. Research focuses on improving routing stability (e.g., soft MoE, which uses soft merging instead of hard selection), fine-grained expert allocation (e.g., DeepSeek-MoE with finer-grained experts), and efficient serving through speculative decoding and expert caching. Mixture of Attention Heads (MoA) extends the MoE concept to attention layers. The largest production MoE models exceed 10 trillion parameters (e.g., a 2026 frontier model reported at 12T parameters with 100B active).