A new paper arXiv 2606.20945 proposes Grouped Query Experts, which speeds long-context attention prefill by 1.7–1.8×. The method routes each token to only 9 of 16 query heads without degrading accuracy.
Key facts
- 1.7–1.8× faster prefill on long contexts.
- 250M-parameter models trained on 30B tokens.
- Accuracy 56.04 vs baseline 55.86 (statistically flat).
- Uses 9 of 16 query heads per token.
- Built on top of grouped-query attention (GQA).
Standard multi-head attention forces every token to compute attention across all heads, even when some heads contribute little. Grouped Query Experts (GQE) solves this by layering a mixture-of-experts router on top of grouped-query attention (GQA), the technique already used by models like Llama 2 and Mistral to shrink key-value cache.
How GQE works

GQE keeps the normal key and value cache from GQA but introduces query-head experts. A learned router assigns each token to a subset of query heads—typically 9 out of 16—while a single shared head always stays on to provide a stable learning signal. According to @rohanpaul_ai, this is "like giving the model many possible attention patterns, while making each token pay for only the small set that seems useful."
Benchmark results
The authors trained 250M-parameter models on 30B tokens, comparing GQE against a standard GQA baseline. The best GQE configuration matched baseline average accuracy: 56.04 versus 55.86, while using only 9 of 16 query-attention computations. Prefill speedup reached 1.7–1.8× for long contexts. The paper notes that the router requires a strong learning signal—without it, quality degrades.
Why this matters for LLM inference
![]()
Long-context models (Gemini 1.5 Pro, Claude 3.5 Sonnet, GPT-4-128k) all face quadratic attention cost. GQA already halves KV cache size; GQE further halves query computation. The technique is orthogonal to FlashAttention and speculative decoding, meaning operators can stack all three. The 44% reduction in query compute could translate to meaningful latency improvements for production serving, especially on memory-bound hardware where attention dominates.
Limitations
The paper only tests 250M-parameter models. Scaling to 7B+ parameters—where attention cost is most painful—remains unproven. The router's training stability at scale is unclear. The authors did not release code or trained weights.
What to watch
Watch for a follow-up scaling GQE to 7B+ parameters, ideally with open-weight release. Also track whether inference engine vendors (vLLM, TensorRT-LLM) integrate the technique—production adoption would validate the paper's claims beyond the 250M regime.








