Skip to content
gentic.news — AI News Intelligence Platform

Technique · architecture

Mixture of Experts (Sparse MoE for LLMs)

An architecture where a router activates only a subset of expert sub-networks per token, scaling parameter count without proportional compute cost.

Origin: Google, 2017-01Read origin paper →Also known as: MoE, Sparse MoE, Sparsely-Gated MoE
17
Products deploying
9y
Avg research → prod
8y
First commercial deploy

Deployment timeline

  1. Llama 4 Scout

    Deployed 2025-04-05 · Velocity 8y

    Meta's first natively multimodal open-weight MoE model with 17B active / 109B total params, 16 experts

    high
  2. Llama 4 Maverick

    Deployed 2025-04-05 · Velocity 8y

    Meta's flagship open-weight multimodal MoE with 17B active / 400B total params, 128 experts

    high
  3. GPT-4o

    Deployed 2026-02-16 · Velocity 9y

    GPT-4 is widely reported as a MoE model. GPT-4o is its successor, implying architectural continuity.

    medium
  4. GPT-5

    Deployed 2026-02-16 · Velocity 9y

    GPT-5 is widely reported to be a Mixture of Experts (MoE) model, scaling parameters efficiently.

    high
  5. Gemini 3 Pro

    Deployed 2026-02-19 · Velocity 9y

    Gemini 1.5 Pro is a Mixture-of-Experts (MoE) model with 8 experts and 2 active per token.

    high
  6. Gemini 3.1

    Deployed 2026-02-20 · Velocity 9y

    Gemini 3.1 is distinguished by its 'Mixture of Experts' (MoE) architecture.

    high
  7. GPT-5.3

    Deployed 2026-02-26 · Velocity 9y

    OpenAI has explored MoE architectures (e.g., GPT-4); GPT-5.3 likely uses sparse MoE for efficient scaling.

    medium
  8. Gemini 3 Flash

    Deployed 2026-02-27 · Velocity 9y

    Gemini 1.5 Pro uses a Mixture-of-Experts (MoE) architecture. While Flash is a dense model, the overall Gemini family deploys MoE.

    medium
  9. Kimi K2.5

    Deployed 2026-03-04 · Velocity 9y

    The 1 trillion parameter count strongly suggests a Mixture of Experts architecture to manage computational costs.

    high
  10. Gemini 3.1 Flash-Lite

    Deployed 2026-03-05 · Velocity 9y

    Gemini 1.5 Pro uses a Mixture-of-Experts (MoE) architecture. Flash-Lite is a distilled version of the larger MoE models.

    high
  11. DeepSeek-V3

    Deployed 2026-03-11 · Velocity 9y

    DeepSeek-V3 is a highly efficient mixture-of-experts language model.

    high
  12. Nemotron 3 Super

    Deployed 2026-03-11 · Velocity 9y

    Nemotron 3 Super uses a hybrid Mamba-Transformer MoE architecture.

    high
  13. Mistral Small 4

    Deployed 2026-03-16 · Velocity 9y

    Mistral Small 4 is a 119B-parameter Mixture of Experts model.

    high
  14. DeepSeek-R1

    Deployed 2026-03-17 · Velocity 9y

    671B parameter model uses sparse mixture-of-experts architecture.

    high
  15. Qwen 3.6

    Deployed 2026-03-31 · Velocity 9y

    Qwen 3.6 includes a MoE version (Qwen 3.6 MoE) with 14B active parameters.

    high
  16. GPT-5.4-Cyber

    Deployed 2026-04-16 · Velocity 9y

    GPT-4 is reported to use a Mixture of Experts architecture.

    high
  17. Kimi K2.6

    Deployed 2026-04-20 · Velocity 9y

    Moonshot AI's 1T-param MoE (32B active) architecture explicitly uses mixture-of-experts.

    high

Techniques built on this