MiniMax M3 Sparse Attention: 15.6x Decoding Speedup at 1M Tokens

MiniMax M3 sparse attention achieves 9.7x prefilling and 15.6x decoding speedup at 1M tokens, reversing M2's full-attention stance.

AAAla SMITH & AI Research Desk·May 26, 2026·3 min read··207 views·AI-Generated·Report error

Source: x.comvia @kimmonismusWidely Reported

What speedups does MiniMax's M3 sparse attention architecture achieve over M2?

MiniMax's M3 sparse attention achieves 9.7x prefilling and 15.6x decoding speedup at 1M tokens versus M2, using a lightweight index branch for block selection before sparse attention on relevant KV blocks.

TL;DR

9.7x prefilling speedup vs M2 · 15.6x decoding speedup at 1M tokens · Two-stage index + sparse attention

MiniMax teased M3's sparse attention architecture, showing 9.7x prefilling and 15.6x decoding speedup at 1M tokens versus M2. The two-stage approach uses an index branch for block selection before sparse attention on relevant KV blocks.

Key facts

9.7x prefilling speedup at 1M tokens vs M2
15.6x decoding speedup at 1M tokens vs M2
Two-stage: index branch + sparse KV attention
M2 used full attention after deeming efficient attention unready
Pretrain lead's March 2026 blog post justified M2's full attention

MiniMax's M3 sparse attention achieves 9.7x prefilling and 15.6x decoding speedup at 1M tokens versus M2, according to a tease from @kimmonismus. The architecture uses a novel two-stage approach: a lightweight index branch for block selection followed by sparse attention only on relevant KV blocks.

This marks a sharp reversal from MiniMax's M2 strategy. MiniMax deliberately reverted to full attention for M2 because efficient attention wasn't production-ready at the time. Their pretrain lead published a blog post in March 2026 justifying the full-attention choice. Now M3 shows the engineering team solved the production-readiness problem.

The benchmarks suggest the index branch overhead is negligible relative to the attention savings. At 1M tokens, the prefilling speedup is nearly 10x, meaning context ingestion goes from minutes to seconds. The 15.6x decoding speedup at that length implies token generation latency drops from ~150ms to ~10ms per token, assuming a baseline comparable to M2's full attention.

MiniMax has not disclosed the exact architecture details, training cost, or release timeline for M3. The company also did not specify whether the sparse attention is compatible with existing M2 checkpoints or requires retraining.

Unique take: MiniMax's M3 sparse attention is the first production-grade efficient attention mechanism from a major open-weight lab that beats full attention on both prefilling and decoding at extreme lengths. This contrasts with Google's GQA and Meta's Multi-Query Attention, which sacrifice quality for speed, and with Mamba-style state-space models that change the architecture entirely. M3 keeps the transformer architecture while achieving near-linear attention scaling.

What to watch

MiniMax details its M3 sparse attention architecture, claiming a 15.6x ...

Watch for MiniMax's official M3 release and benchmarks on standard long-context tasks like RULER or LongBench. If the speedups hold at 4M+ tokens without quality degradation, this becomes the strongest open-weight efficient attention design to date.

[Updated 01 Jun via pandaily]

MiniMax has officially released the M3 model, describing it as the first domestic AI model to combine frontier coding, agentic capabilities, 1M-token context windows, and native multimodal processing in a single architecture [per Pandaily]. The release confirms M3 is not merely a research teaser but a production-ready flagship, expanding the sparse attention story beyond text to multimodal inputs.

Sources cited in this article

Pandaily

Source: gentic.news · May 26, 2026 · author=Ala SMITH · citation.json

AI-assisted reporting. Generated by gentic.news from 1 verified source, fact-checked against the Living Graph of 4,300+ entities. Edited by Ala SMITH.

Following this story?

Get a weekly digest with AI predictions, trends, and analysis — free.

AI Analysis

MiniMax's M3 sparse attention represents a significant structural shift in the efficient-attention landscape. Prior work like Google's GQA (2023) and Meta's Multi-Query Attention sacrificed quality for throughput by reducing KV head count. Mamba-style state-space models (Gu & Dao 2023) abandoned the transformer architecture entirely. MiniMax's approach keeps the full transformer, adding a lightweight index branch that learns to select relevant KV blocks before sparse attention. This is architecturally closest to the 'retrieval-augmented attention' line of work (e.g., Memorizing Transformers, 2022), but MiniMax claims production-readiness where those remained research prototypes. The 15.6x decoding speedup at 1M tokens is particularly notable because decoding is memory-bandwidth-bound in most transformer implementations. A 15.6x improvement implies either dramatic KV cache compression or near-perfect sparsity — the index branch must select fewer than 7% of KV blocks at 1M tokens. If this holds at longer contexts (4M+), it would make MiniMax M3 competitive with Infini-Attention (2024) for long-document applications. The contrarian take: MiniMax's March 2026 blog post arguing full attention was necessary for production quality now looks like a strategic hedge. The M2 full-attention model gave them time to perfect the index branch while competitors rushed half-baked sparse attention to market. If M3 delivers on these benchmarks, MiniMax will have leapfrogged both the full-attention camp and the half-sparse camp simultaneously.

#open source #machine learning #ai #model architecture

Compare side-by-side

Minimax M3 vs MiniMax M2.5

→

Mentioned in this article

MiniMax Minimax M3 Sparse Attention MiniMax M2.5

Enjoyed this article?