Technique · architecture

Mixture of Experts (Sparse MoE for LLMs)

An architecture where a router activates only a subset of expert sub-networks per token, scaling parameter count without proportional compute cost.

Origin: Google, 2017-01Read origin paper →Also known as: MoE, Sparse MoE, Sparsely-Gated MoE

Products deploying

Avg research → prod

First commercial deploy

Deployment timeline

Llama 4 Scout
Deployed 2025-04-05 · Velocity 8y
“Meta's first natively multimodal open-weight MoE model with 17B active / 109B total params, 16 experts”
high
Llama 4 Maverick
Deployed 2025-04-05 · Velocity 8y
“Meta's flagship open-weight multimodal MoE with 17B active / 400B total params, 128 experts”
high
GPT-4o
Deployed 2026-02-16 · Velocity 9y
“GPT-4 is widely reported as a MoE model. GPT-4o is its successor, implying architectural continuity.”
medium
GPT-5
Deployed 2026-02-16 · Velocity 9y
“GPT-5 is widely reported to be a Mixture of Experts (MoE) model, scaling parameters efficiently.”
high
Gemini 3 Pro
Deployed 2026-02-19 · Velocity 9y
“Gemini 1.5 Pro is a Mixture-of-Experts (MoE) model with 8 experts and 2 active per token.”
high
Gemini 3.1
Deployed 2026-02-20 · Velocity 9y
“Gemini 3.1 is distinguished by its 'Mixture of Experts' (MoE) architecture.”
high
GPT-5.3
Deployed 2026-02-26 · Velocity 9y
“OpenAI has explored MoE architectures (e.g., GPT-4); GPT-5.3 likely uses sparse MoE for efficient scaling.”
medium
Gemini 3 Flash
Deployed 2026-02-27 · Velocity 9y
“Gemini 1.5 Pro uses a Mixture-of-Experts (MoE) architecture. While Flash is a dense model, the overall Gemini family deploys MoE.”
medium
Kimi K2.5
Deployed 2026-03-04 · Velocity 9y
“The 1 trillion parameter count strongly suggests a Mixture of Experts architecture to manage computational costs.”
high
Gemini 3.1 Flash-Lite
Deployed 2026-03-05 · Velocity 9y
“Gemini 1.5 Pro uses a Mixture-of-Experts (MoE) architecture. Flash-Lite is a distilled version of the larger MoE models.”
high
DeepSeek-V3
Deployed 2026-03-11 · Velocity 9y
“DeepSeek-V3 is a highly efficient mixture-of-experts language model.”
high
Nemotron 3 Super
Deployed 2026-03-11 · Velocity 9y
“Nemotron 3 Super uses a hybrid Mamba-Transformer MoE architecture.”
high
Mistral Small 4
Deployed 2026-03-16 · Velocity 9y
“Mistral Small 4 is a 119B-parameter Mixture of Experts model.”
high
DeepSeek-R1
Deployed 2026-03-17 · Velocity 9y
“671B parameter model uses sparse mixture-of-experts architecture.”
high
Qwen 3.6
Deployed 2026-03-31 · Velocity 9y
“Qwen 3.6 includes a MoE version (Qwen 3.6 MoE) with 14B active parameters.”
high
GPT-5.4-Cyber
Deployed 2026-04-16 · Velocity 9y
“GPT-4 is reported to use a Mixture of Experts architecture.”
high
Kimi K2.6
Deployed 2026-04-20 · Velocity 9y
“Moonshot AI's 1T-param MoE (32B active) architecture explicitly uses mixture-of-experts.”
high

Techniques built on this

Mixture of Depths