Listen to today's AI briefing

Daily podcast — 5 min, AI-narrated summary of top stories

A diagram illustrating MiniMax M3's sparse attention mechanism, showing how the model processes a 1M context window…

MiniMax M3: Sparse Attention, 1M Context, Multimodal via Together

MiniMax M3 uses sparse attention for 1M context and multimodality, with Together AI serving fast inference.

AAAla SMITH & AI Research Desk·Jun 3, 2026·3 min read··167 views·AI-Generated·Report error

Source: x.comvia @MiniMax_AIMulti-Source

What are the key features of MiniMax's M3 model?

MiniMax's M3 model features sparse attention, a 1M-token context window, and multimodal capabilities, with Together AI handling the serving infrastructure to deliver fast inference.

TL;DR

MiniMax M3 uses sparse attention for efficiency. · 1M context window enables long-document reasoning. · Together AI serves the model at production speed.

MiniMax M3 introduces sparse attention, a 1M context window, and multimodality. Together AI handled the serving infrastructure to deliver fast inference.

Key facts

Model: MiniMax M3 with sparse attention.
Context window: 1 million tokens.
Multimodal: text, images, audio support.
Serving partner: Together AI for inference.
No benchmark or parameter count disclosed.

MiniMax's M3 model, announced via a post by Skyler Miao on X @MiniMax_AI, combines three technical innovations: sparse attention, a 1M-token context window, and multimodal input processing. The sparse attention mechanism reduces the quadratic complexity of standard full attention, enabling the model to handle sequences up to 1 million tokens without proportional compute scaling. This makes M3 suitable for tasks like long-document summarization, codebase analysis, and retrieval-augmented generation over large corpora.

Together AI, a cloud provider specializing in AI inference, contributed the serving layer to make M3 fast at production scale. The partnership highlights a growing trend: model developers focusing on architecture while infrastructure partners optimize deployment. Together's inference stack likely uses custom kernels and batching to exploit M3's sparse attention sparsity, achieving latency competitive with dense models of similar size.

Sparse Attention and Context Scaling

Sparse attention has been explored in earlier models like Longformer (Beltagy et al. 2020) and BigBird (Zaheer et al. 2020), but M3's implementation appears to target real-time inference at 1M tokens. The exact sparsity pattern (e.g., fixed stride, dilated sliding window, or learned) was not disclosed. A 1M context window places M3 alongside GPT-4 (128K tokens) and Claude 3 (200K tokens) but exceeds them by 5-8x, though benchmark comparisons are absent from the announcement.

Multimodal Capabilities

M3 supports multiple modalities including text, images, and audio, per the source. This aligns with the industry shift toward unified models, as seen with Gemini 1.5 Pro and GPT-4V. No specific benchmarks or performance numbers were provided for multimodal tasks.

Serving Infrastructure

Together AI's role is critical: without optimized inference, sparse attention can be slower than dense attention due to irregular memory access patterns. Together's team likely implemented fused kernels and speculative decoding to mask the overhead. The result is a model that, according to the announcement, runs "fast" at scale.

Missing Details

The source does not disclose model size (parameter count), training data, open-source availability, pricing, or benchmark results. These omissions make it difficult to assess M3's practical value relative to existing models. The announcement is promotional rather than technical, leaving many questions unanswered.

What to watch

Watch for benchmark results (e.g., RULER, Needle-in-a-Haystack, MMLU) and open-source availability. If M3 is released under a permissive license, it could challenge Llama 3.1 and Mistral in long-context tasks. Also monitor Together AI's inference pricing and latency benchmarks.

Source: gentic.news · Jun 3, 2026 · author=Ala SMITH · citation.json

AI-assisted reporting. Generated by gentic.news from multiple verified sources, fact-checked against the Living Graph of 4,300+ entities. Edited by Ala SMITH.

Following this story?

Get a weekly digest with AI predictions, trends, and analysis — free.

AI Analysis

M3's sparse attention + 1M context is technically notable but not novel—Longformer and BigBird established the approach years ago. What's new is the partnership with Together AI, which signals a maturing ecosystem where model developers outsource serving. The lack of benchmarks or parameter counts suggests this is a pre-release preview rather than a production-ready model. The real test will be whether M3 can maintain coherence across 1M tokens—a challenge that has tripped up many long-context models. Together AI's inference optimizations may give M3 a speed advantage, but without open-source weights or API pricing, it's hard to gauge impact. The announcement's brevity contrasts with the hype around context scaling, which has become a commodity feature rather than a differentiator.

#context window #ai models #sparse attention #inference

Compare side-by-side

MiniMax vs Together AI

→

Mentioned in this article

Minimax M3 MiniMax Together AI Sparse Attention Skyler Miao

Enjoyed this article?

Get the weekly AI intelligence briefing

✨AI Toolslive

Five one-click lenses on this article. Cached for 24h.

Pick a tool above to generate an instant lens on this article.

AI Research3 shared topics

Crusoe, Lancium Build 1GW Texas AI Campus, Sidestepping Grid

From the lab

The framework underneath this story

Every article on this site sits on top of one engine and one framework — both built by the lab.

Original research · EUMAS 2026

MNEMA — A Witness Lattice for Multi-Agent AI Memory

Cryptographic memory units · 1−α detection floor · 15 pp PDF

Field framework · v1.0

Epistemic Infrastructure

12 pillars · 11-stage knowledge metabolism · pathology catalog

MiniMax M3: Sparse Attention, 1M Context, Multimodal via Together

Sparse Attention and Context Scaling

Multimodal Capabilities

Serving Infrastructure

Missing Details

What to watch

AI Analysis

✨AI Toolslive

Related Articles

MiniMax M3 Sparse Attention: 15.6x Decoding Speedup at 1M Tokens

MiniMax-M3 Scores 55 on AI Index, Open-Source Lead Looms

Kimi K3 Tops US Models in Front-End Coding at Smaller Scale

Moonshot AI's Kimi K3: 2.8T params, 1M token window, $3/M input

Japan Builds $2B+ Rubin AI Factory for National Robotics Push

Crusoe, Lancium Build 1GW Texas AI Campus, Sidestepping Grid

The framework underneath this story

More in AI Research

LLMs Learn to Switch Reasoning Effort at Inference Time

HG-RAG Beats Flat Retrieval on Graph Queries Across 800-Node Worlds

LongStraw Reaches 2.1M Tokens on 8 H20 GPUs via Branch Replay