sparse attention
30 articles about sparse attention in AI news
MiniMax M3: Sparse Attention, 1M Context, Multimodal via Together
MiniMax M3 uses sparse attention for 1M context and multimodality, with Together AI serving fast inference.
MiniMax M3 Sparse Attention: 15.6x Decoding Speedup at 1M Tokens
MiniMax M3 sparse attention achieves 9.7x prefilling and 15.6x decoding speedup at 1M tokens, reversing M2's full-attention stance.
DeepSeek's HISA: Hierarchical Sparse Attention Cuts 64K Context Indexing Cost
DeepSeek researchers introduced HISA, a hierarchical sparse attention method that replaces flat token scanning. It removes a computational bottleneck at 64K context lengths without requiring any model retraining.
Memory Sparse Attention (MSA) Achieves 100M Token Context with Near-Linear Complexity
A new attention architecture, Memory Sparse Attention (MSA), breaks the 100M token context barrier while maintaining 94% accuracy at 1M tokens. It uses document-wise RoPE and end-to-end sparse attention to outperform RAG systems and frontier models.
Memory Sparse Attention (MSA) Enables 100M Token Context Windows with Minimal Performance Loss
Memory Sparse Attention (MSA) is a proposed architecture that allows AI models to store and reason over massive long-term memory directly within their attention mechanism, eliminating the need for external retrieval systems. The approach reportedly enables context windows of up to 100 million tokens with minimal performance degradation.
VSPrefill: The Vertical-Slash Breakthrough That Makes 128K Contexts Practical
Researchers have developed VSPrefill, a novel sparse attention mechanism that dramatically accelerates long-context processing in LLMs. Using lightweight indexing of vertical columns and slash diagonals, it achieves 4.95x speedup while maintaining 98.35% accuracy at 128k context lengths.
RCLRec: Reverse Curriculum Learning Targets Sparse Conversion Problem in Generative Recommendation
Researchers propose RCLRec, a reverse curriculum learning framework for generative recommendation that specifically addresses sparse conversion signals. By constructing short, conversion-focused curricula from user history, it provides targeted supervision, boosting online ad revenue by +2.09% and orders by +1.86%.
Sparse Sensors, Rich Views: How Minimal Radar Data Supercharges AI Scene Generation
Researchers have developed a novel approach that combines single images with extremely sparse radar or LiDAR data to dramatically improve AI's ability to generate realistic 3D views from 2D photos. This multimodal technique overcomes fundamental limitations of vision-only systems in challenging conditions like bad weather and low texture.
Multi-Level Graph Contrastive Learning Beats SOTA on KG Recommendations
Multi-level graph attention network with contrastive learning outperforms SOTA on KG recommendations by handling sparse labels and noisy entities.
Alibaba's Qwen3.5: The Efficiency Breakthrough That Could Democratize Multimodal AI
Alibaba has open-sourced Qwen3.5, a multimodal AI model that combines linear attention with sparse Mixture of Experts architecture to deliver high performance without exorbitant computational costs, potentially making advanced AI more accessible.
DeepSeek V4-Pro: 1.6T parameters, open weights, undercuts rivals 10x
DeepSeek unveiled V4-Pro and V4-Flash, its largest open-weight models with up to 1.6 trillion parameters and a 1M-token context window. The new hybrid attention architecture cuts compute for long contexts by 73–90%, enabling prices far below OpenAI, Google, and Anthropic.
FalkorDB: Graph Database for Multi-Hop AI Queries in Milliseconds
FalkorDB, an open-source graph database, stores connections as a sparse matrix to accelerate multi-hop queries by 100x. Combined with built-in vector search, it enables GraphRAG systems that answer complex relational questions without pre-built articles.
UniMixer: A Unified Architecture for Scaling Laws in Recommendation Systems
A new arXiv paper introduces UniMixer, a unified scaling architecture for recommender systems. It bridges attention-based, TokenMixer-based, and factorization-machine-based methods into a single theoretical framework, aiming to improve parameter efficiency and scaling return on investment (ROI).
HyenaRec: A Polynomial-Based Architecture for Fast, Scalable Sequential Recommendation
Researchers propose HyenaRec, a novel sequential recommender using Legendre polynomial kernels and gated convolutions. It achieves better accuracy than attention-based models while training up to 6x faster, especially on long user histories. This addresses a critical efficiency bottleneck in next-item prediction.
arXiv Survey Maps KV Cache Optimization Landscape: 5 Strategies for Million-Token LLM Inference
A comprehensive arXiv review categorizes five principal KV cache optimization techniques—eviction, compression, hybrid memory, novel attention, and combinations—to address the linear memory scaling bottleneck in long-context LLM inference. The analysis finds no single dominant solution, with optimal strategy depending on context length, hardware, and workload.
Step-3.5-Flash: 196B Open-Source MoE Model Activates Only 11B Parameters, Outperforms Kimi K2.5 and Claude Opus 4.5 on Key Benchmarks
Shanghai-based StepFun's Step-3.5-Flash, a 196B parameter sparse mixture-of-experts model that activates only 11B parameters per token, achieves top scores on AIME 2025 (97.3) and LiveCodeBench-V6 (86.4) while costing 18.9x less to run than Kimi K2.5.
ReFORM: A New LLM Framework for Multi-Factor Recommendation from User Reviews
Researchers propose ReFORM, a novel recommendation framework that uses LLMs to generate factor-specific user and item profiles from reviews, then applies multi-factor attention to personalize suggestions. It outperforms state-of-the-art baselines on restaurant datasets, offering a more nuanced approach to personalization.
STAR-Set Transformer: AI Finally Makes Sense of Messy Medical Data
Researchers have developed a new transformer architecture that handles irregular, asynchronous medical time series by incorporating temporal and variable-type attention biases, outperforming existing methods on ICU prediction tasks while providing interpretable insights.
Amazon's T-REX: A Transformer Architecture for Next-Basket Grocery Recommendations
Amazon researchers propose T-REX, a transformer-based model for grocery basket recommendations. It addresses unique challenges like repetitive purchases and sparse patterns through category-level modeling and causal masking, showing significant improvements in offline/online tests.
The Laptop Agent Revolution: How 24B-Parameter Models Are Redefining On-Device AI
Liquid's LFM2-24B-A2B model runs locally on laptops, selecting tools in under 400ms. Its hybrid architecture enables sparse activation, making powerful AI agents practical for regulated industries and developers without cloud dependencies.
Sakana AI's Doc-to-LoRA: A Hypernetwork Breakthrough for Efficient Long-Context Processing
Sakana AI introduces Doc-to-LoRA, a lightweight hypernetwork that meta-learns to compress long documents into efficient LoRA adapters, dramatically reducing the computational costs of processing lengthy text. This innovation addresses the quadratic attention bottleneck that makes long-context AI models expensive and slow.
ByteDance Builds In-House AI CPUs for TikTok-Scale Agent Inference
ByteDance builds custom AI CPUs for inference at TikTok scale, targeting scarce server supply. The move signals agent workload shift from training to inference hardware.
Median Coding Agent Hits 96k Input Tokens, Rewriting Inference Economics
SemiAnalysis found median coding agent uses 96k input tokens from 432k requests, shifting inference cost focus from output to context.
How a Custom Multimodal Transformer Beat a Fine-Tuned LLM for Attribute
LeBonCoin's ML team built a custom late-fusion transformer that uses pre-computed visual embeddings and character n-gram text vectors to predict ad attributes. It outperformed a fine-tuned VLM while running on CPU with sub-200ms latency, offering calibrated probabilities and 15-minute retraining cycles.
CPU Demand Flipping the AI Narrative as Datacenter Growth Shifts
A new analysis from SemiAnalysis indicates CPU demand is rising in AI datacenters, reversing a narrative of GPU-only dominance. This shift signals changing workload patterns and infrastructure priorities.
New MoE Framework Tames User Interest Shifts in Long-Sequence Recommendations
Researchers propose MoS, a model-agnostic MoE approach that handles long user sequences by detecting session hopping – where user interests shift across sessions. The theme-aware routing mechanism filters irrelevant sessions, while multi-scale fusion captures global and local patterns. Results show SOTA on benchmarks with fewer FLOPs than alternatives.
Sam Altman: AI inference costs dropped 1000x from o1 to GPT-5.4
Sam Altman stated AI inference costs for solving a fixed hard problem dropped ~1000x from o1 to GPT-5.4 in ~16 months, crediting cross-layer engineering optimizations, not a single breakthrough.
Google Cloud Next '26: 8th-gen TPUs, agent platform, $750M fund
At Cloud Next 2026, Google unveiled two 8th-gen TPU chips, a Gemini-based enterprise AI agent platform, and a $750 million partner fund to drive secure, large-scale automation and heavy AI workloads.
Google's Memory Caching Bridges RNN-Transformer Gap with O(NL) Complexity
Google's 'Memory Caching' method saves RNN memory states at segment boundaries, allowing tokens to reference past checkpoints. This O(NL) approach significantly improves RNN performance on recall tasks, narrowing the gap with Transformers.
SID-Coord: A New Framework for Balancing Memorization and Generalization
A new arXiv paper introduces SID-Coord, a framework that integrates trainable Semantic IDs (SIDs) with traditional Hashed IDs (HIDs) in ranking models. It aims to solve the memorization-generalization trade-off, improving performance on long-tail items. Online A/B tests in a production short-video search system showed statistically significant improvements in engagement metrics.