Skip to content
gentic.news — AI News Intelligence Platform

Technique · architecture

Grouped-Query Attention (GQA)

An inference-time optimization that groups multiple query heads to share a single key/value head, reducing KV cache memory at minimal quality loss.

Origin: Google, 2023-05Read origin paper →Also known as: GQA, Multi-Query Attention variant
10
Products deploying
3y
Avg research → prod
3y
First commercial deploy

Deployment timeline

  1. GPT-5

    Deployed 2026-02-16 · Velocity 3y

    GQA is a standard inference optimization for large-scale models to reduce memory overhead.

    medium
  2. Claude 3

    Deployed 2026-02-18 · Velocity 3y

    Claude 3 models use GQA to improve inference efficiency, as stated in the system card.

    high
  3. Gemini 3 Pro

    Deployed 2026-02-19 · Velocity 3y

    Gemini 1.5 uses grouped-query attention (GQA) for efficient inference.

    high
  4. GPT-5.3

    Deployed 2026-02-26 · Velocity 3y

    GQA is widely adopted in state-of-the-art LLMs for inference efficiency; GPT-5.3 likely incorporates similar optimizations.

    medium
  5. Gemini 3 Flash

    Deployed 2026-02-27 · Velocity 3y

    Gemini 1.5 models use grouped-query attention (GQA) for efficient inference, as detailed in the technical report.

    high
  6. Kimi K2.5

    Deployed 2026-03-04 · Velocity 3y

    As a large-scale model, Kimi K2.5 likely uses GQA to manage KV cache memory efficiently for its 1T parameters.

    medium
  7. DeepSeek-V3

    Deployed 2026-03-11 · Velocity 3y

    DeepSeek-V3 uses Grouped-Query Attention (GQA).

    high
  8. Mistral Small 4

    Deployed 2026-03-16 · Velocity 3y

    Mistral models use Grouped-Query Attention (GQA).

    high
  9. GLM-5.1

    Deployed 2026-03-21 · Velocity 3y

    GLM-5.1 architecture uses Grouped-Query Attention (GQA) to reduce KV cache memory.

    high
  10. Qwen 3.6

    Deployed 2026-03-31 · Velocity 3y

    Qwen 3.6 uses GQA to reduce memory usage and improve inference speed.

    high