Technique · architecture
Grouped-Query Attention (GQA)
An inference-time optimization that groups multiple query heads to share a single key/value head, reducing KV cache memory at minimal quality loss.
Deployment timeline
- GPT-5medium
Deployed 2026-02-16 · Velocity 3y
“GQA is a standard inference optimization for large-scale models to reduce memory overhead.”
- Claude 3high
Deployed 2026-02-18 · Velocity 3y
“Claude 3 models use GQA to improve inference efficiency, as stated in the system card.”
- Gemini 3 Prohigh
Deployed 2026-02-19 · Velocity 3y
“Gemini 1.5 uses grouped-query attention (GQA) for efficient inference.”
- GPT-5.3medium
Deployed 2026-02-26 · Velocity 3y
“GQA is widely adopted in state-of-the-art LLMs for inference efficiency; GPT-5.3 likely incorporates similar optimizations.”
- Gemini 3 Flashhigh
Deployed 2026-02-27 · Velocity 3y
“Gemini 1.5 models use grouped-query attention (GQA) for efficient inference, as detailed in the technical report.”
- Kimi K2.5medium
Deployed 2026-03-04 · Velocity 3y
“As a large-scale model, Kimi K2.5 likely uses GQA to manage KV cache memory efficiently for its 1T parameters.”
- high
- high
- GLM-5.1high
Deployed 2026-03-21 · Velocity 3y
“GLM-5.1 architecture uses Grouped-Query Attention (GQA) to reduce KV cache memory.”
- Qwen 3.6high
Deployed 2026-03-31 · Velocity 3y
“Qwen 3.6 uses GQA to reduce memory usage and improve inference speed.”