Technique · architecture
Mixture of Experts (Sparse MoE for LLMs)
An architecture where a router activates only a subset of expert sub-networks per token, scaling parameter count without proportional compute cost.
Deployment timeline
- Llama 4 Scouthigh
Deployed 2025-04-05 · Velocity 8y
“Meta's first natively multimodal open-weight MoE model with 17B active / 109B total params, 16 experts”
- Llama 4 Maverickhigh
Deployed 2025-04-05 · Velocity 8y
“Meta's flagship open-weight multimodal MoE with 17B active / 400B total params, 128 experts”
- GPT-4omedium
Deployed 2026-02-16 · Velocity 9y
“GPT-4 is widely reported as a MoE model. GPT-4o is its successor, implying architectural continuity.”
- GPT-5high
Deployed 2026-02-16 · Velocity 9y
“GPT-5 is widely reported to be a Mixture of Experts (MoE) model, scaling parameters efficiently.”
- Gemini 3 Prohigh
Deployed 2026-02-19 · Velocity 9y
“Gemini 1.5 Pro is a Mixture-of-Experts (MoE) model with 8 experts and 2 active per token.”
- Gemini 3.1high
Deployed 2026-02-20 · Velocity 9y
“Gemini 3.1 is distinguished by its 'Mixture of Experts' (MoE) architecture.”
- GPT-5.3medium
Deployed 2026-02-26 · Velocity 9y
“OpenAI has explored MoE architectures (e.g., GPT-4); GPT-5.3 likely uses sparse MoE for efficient scaling.”
- Gemini 3 Flashmedium
Deployed 2026-02-27 · Velocity 9y
“Gemini 1.5 Pro uses a Mixture-of-Experts (MoE) architecture. While Flash is a dense model, the overall Gemini family deploys MoE.”
- Kimi K2.5high
Deployed 2026-03-04 · Velocity 9y
“The 1 trillion parameter count strongly suggests a Mixture of Experts architecture to manage computational costs.”
- Gemini 3.1 Flash-Litehigh
Deployed 2026-03-05 · Velocity 9y
“Gemini 1.5 Pro uses a Mixture-of-Experts (MoE) architecture. Flash-Lite is a distilled version of the larger MoE models.”
- DeepSeek-V3high
Deployed 2026-03-11 · Velocity 9y
“DeepSeek-V3 is a highly efficient mixture-of-experts language model.”
- Nemotron 3 Superhigh
Deployed 2026-03-11 · Velocity 9y
“Nemotron 3 Super uses a hybrid Mamba-Transformer MoE architecture.”
- Mistral Small 4high
Deployed 2026-03-16 · Velocity 9y
“Mistral Small 4 is a 119B-parameter Mixture of Experts model.”
- DeepSeek-R1high
Deployed 2026-03-17 · Velocity 9y
“671B parameter model uses sparse mixture-of-experts architecture.”
- Qwen 3.6high
Deployed 2026-03-31 · Velocity 9y
“Qwen 3.6 includes a MoE version (Qwen 3.6 MoE) with 14B active parameters.”
- GPT-5.4-Cyberhigh
Deployed 2026-04-16 · Velocity 9y
“GPT-4 is reported to use a Mixture of Experts architecture.”
- Kimi K2.6high
Deployed 2026-04-20 · Velocity 9y
“Moonshot AI's 1T-param MoE (32B active) architecture explicitly uses mixture-of-experts.”