Technique · architecture

Grouped-Query Attention (GQA)

An inference-time optimization that groups multiple query heads to share a single key/value head, reducing KV cache memory at minimal quality loss.

Origin: Google, 2023-05Read origin paper →Also known as: GQA, Multi-Query Attention variant

Products deploying

Avg research → prod

First commercial deploy

Deployment timeline

GPT-5
Deployed 2026-02-16 · Velocity 3y
“GQA is a standard inference optimization for large-scale models to reduce memory overhead.”
medium
Claude 3
Deployed 2026-02-18 · Velocity 3y
“Claude 3 models use GQA to improve inference efficiency, as stated in the system card.”
high
Gemini 3 Pro
Deployed 2026-02-19 · Velocity 3y
“Gemini 1.5 uses grouped-query attention (GQA) for efficient inference.”
high
GPT-5.3
Deployed 2026-02-26 · Velocity 3y
“GQA is widely adopted in state-of-the-art LLMs for inference efficiency; GPT-5.3 likely incorporates similar optimizations.”
medium
Gemini 3 Flash
Deployed 2026-02-27 · Velocity 3y
“Gemini 1.5 models use grouped-query attention (GQA) for efficient inference, as detailed in the technical report.”
high
Kimi K2.5
Deployed 2026-03-04 · Velocity 3y
“As a large-scale model, Kimi K2.5 likely uses GQA to manage KV cache memory efficiently for its 1T parameters.”
medium
DeepSeek-V3
Deployed 2026-03-11 · Velocity 3y
“DeepSeek-V3 uses Grouped-Query Attention (GQA).”
high
Mistral Small 4
Deployed 2026-03-16 · Velocity 3y
“Mistral models use Grouped-Query Attention (GQA).”
high
GLM-5.1
Deployed 2026-03-21 · Velocity 3y
“GLM-5.1 architecture uses Grouped-Query Attention (GQA) to reduce KV cache memory.”
high
Qwen 3.6
Deployed 2026-03-31 · Velocity 3y
“Qwen 3.6 uses GQA to reduce memory usage and improve inference speed.”
high