Skip to content
gentic.news — AI News Intelligence Platform
Connecting to the Living Graph…

Listen to today's AI briefing

Daily podcast — 5 min, AI-narrated summary of top stories

Diagram of MiniMax M3 sparse attention architecture showing index branch selecting blocks before sparse attention…
AI ResearchScore: 92

MiniMax M3 Sparse Attention: 15.6x Decoding Speedup at 1M Tokens

MiniMax M3 sparse attention achieves 9.7x prefilling and 15.6x decoding speedup at 1M tokens, reversing M2's full-attention stance.

·7h ago·2 min read··20 views·AI-Generated·Report error
Share:
What speedups does MiniMax's M3 sparse attention architecture achieve over M2?

MiniMax's M3 sparse attention achieves 9.7x prefilling and 15.6x decoding speedup at 1M tokens versus M2, using a lightweight index branch for block selection before sparse attention on relevant KV blocks.

TL;DR

9.7x prefilling speedup vs M2 · 15.6x decoding speedup at 1M tokens · Two-stage index + sparse attention

MiniMax teased M3's sparse attention architecture, showing 9.7x prefilling and 15.6x decoding speedup at 1M tokens versus M2. The two-stage approach uses an index branch for block selection before sparse attention on relevant KV blocks.

Key facts

  • 9.7x prefilling speedup at 1M tokens vs M2
  • 15.6x decoding speedup at 1M tokens vs M2
  • Two-stage: index branch + sparse KV attention
  • M2 used full attention after deeming efficient attention unready
  • Pretrain lead's March 2026 blog post justified M2's full attention

MiniMax's M3 sparse attention achieves 9.7x prefilling and 15.6x decoding speedup at 1M tokens versus M2, according to a tease from @kimmonismus. The architecture uses a novel two-stage approach: a lightweight index branch for block selection followed by sparse attention only on relevant KV blocks.

This marks a sharp reversal from MiniMax's M2 strategy. MiniMax deliberately reverted to full attention for M2 because efficient attention wasn't production-ready at the time. Their pretrain lead published a blog post in March 2026 justifying the full-attention choice. Now M3 shows the engineering team solved the production-readiness problem.

The benchmarks suggest the index branch overhead is negligible relative to the attention savings. At 1M tokens, the prefilling speedup is nearly 10x, meaning context ingestion goes from minutes to seconds. The 15.6x decoding speedup at that length implies token generation latency drops from ~150ms to ~10ms per token, assuming a baseline comparable to M2's full attention.

MiniMax has not disclosed the exact architecture details, training cost, or release timeline for M3. The company also did not specify whether the sparse attention is compatible with existing M2 checkpoints or requires retraining.

Unique take: MiniMax's M3 sparse attention is the first production-grade efficient attention mechanism from a major open-weight lab that beats full attention on both prefilling and decoding at extreme lengths. This contrasts with Google's GQA and Meta's Multi-Query Attention, which sacrifice quality for speed, and with Mamba-style state-space models that change the architecture entirely. M3 keeps the transformer architecture while achieving near-linear attention scaling.

What to watch

MiniMax details its M3 sparse attention architecture, claiming a 15.6x ...

Watch for MiniMax's official M3 release and benchmarks on standard long-context tasks like RULER or LongBench. If the speedups hold at 4M+ tokens without quality degradation, this becomes the strongest open-weight efficient attention design to date.

Source: gentic.news · · author= · citation.json

AI-assisted reporting. Generated by gentic.news from multiple verified sources, fact-checked against the Living Graph of 4,300+ entities. Edited by Ala SMITH.

Following this story?

Get a weekly digest with AI predictions, trends, and analysis — free.

AI Analysis

MiniMax's M3 sparse attention represents a significant structural shift in the efficient-attention landscape. Prior work like Google's GQA (2023) and Meta's Multi-Query Attention sacrificed quality for throughput by reducing KV head count. Mamba-style state-space models (Gu & Dao 2023) abandoned the transformer architecture entirely. MiniMax's approach keeps the full transformer, adding a lightweight index branch that learns to select relevant KV blocks before sparse attention. This is architecturally closest to the 'retrieval-augmented attention' line of work (e.g., Memorizing Transformers, 2022), but MiniMax claims production-readiness where those remained research prototypes. The 15.6x decoding speedup at 1M tokens is particularly notable because decoding is memory-bandwidth-bound in most transformer implementations. A 15.6x improvement implies either dramatic KV cache compression or near-perfect sparsity — the index branch must select fewer than 7% of KV blocks at 1M tokens. If this holds at longer contexts (4M+), it would make MiniMax M3 competitive with Infini-Attention (2024) for long-document applications. The contrarian take: MiniMax's March 2026 blog post arguing full attention was necessary for production quality now looks like a strategic hedge. The M2 full-attention model gave them time to perfect the index branch while competitors rushed half-baked sparse attention to market. If M3 delivers on these benchmarks, MiniMax will have leapfrogged both the full-attention camp and the half-sparse camp simultaneously.
Compare side-by-side
Minimax M3 vs MiniMax M2.5
Enjoyed this article?
Share:

AI Toolslive

Five one-click lenses on this article. Cached for 24h.

Pick a tool above to generate an instant lens on this article.

Related Articles

From the lab

The framework underneath this story

Every article on this site sits on top of one engine and one framework — both built by the lab.

More in AI Research

View all