attention
30 articles about attention in AI news
Apple's 'Attention to Mamba' Paper Proposes Cross-Architecture Transfer
Apple researchers introduced a two-stage recipe for transferring capabilities from Transformer models to Mamba-based architectures. This could enable efficient models that retain the performance of larger, attention-based predecessors.
DeepSeek's HISA: Hierarchical Sparse Attention Cuts 64K Context Indexing Cost
DeepSeek researchers introduced HISA, a hierarchical sparse attention method that replaces flat token scanning. It removes a computational bottleneck at 64K context lengths without requiring any model retraining.
SteerViT Enables Natural Language Control of Vision Transformer Attention Maps
Researchers introduced SteerViT, a method that modifies Vision Transformers to accept natural language instructions, enabling users to steer the model's visual attention toward specific objects or concepts while maintaining representation quality.
ASI-Evolve Automates AI Research Loop, Discovers 105 Better Linear Attention Designs and Boosts AMC32 Scores by 12.5 Points
Researchers developed ASI-Evolve, an AI system that automates experimental loops in AI research. It discovered 105 improved linear attention variants and boosted AMC32 scores by 12.5 points, demonstrating automated research acceleration.
HIVE Framework Introduces Hierarchical Cross-Attention for Vision-Language Pre-Training, Outperforms Self-Attention on MME and GQA
A new paper introduces HIVE, a hierarchical pre-training framework that connects vision encoders to LLMs via cross-attention across multiple layers. It outperforms conventional self-attention methods on benchmarks like MME and GQA, improving vision-language alignment.
The Cognitive Divergence: AI Context Windows Expand as Human Attention Declines, Creating a Delegation Feedback Loop
A new arXiv paper documents the exponential growth of AI context windows (512 tokens in 2017 to 2M in 2026) alongside a measured decline in human sustained-attention capacity. It introduces the 'Delegation Feedback Loop' hypothesis, where easier AI delegation may further erode human cognitive practice. This is a foundational study on human-AI interaction dynamics.
Memory Sparse Attention (MSA) Achieves 100M Token Context with Near-Linear Complexity
A new attention architecture, Memory Sparse Attention (MSA), breaks the 100M token context barrier while maintaining 94% accuracy at 1M tokens. It uses document-wise RoPE and end-to-end sparse attention to outperform RAG systems and frontier models.
Moonshot AI CEO Yang Zhilin Advocates for Attention Residuals in LLM Architecture
Yang Zhilin, founder of Moonshot AI, argues for the architectural value of attention residuals in large language models. This technical perspective comes from the creator of the popular Kimi Chat model.
Insanely Fast Whisper CLI Transcribes 2.5 Hours of Audio in 98 Seconds with Flash Attention 2
A new open-source CLI tool called Insanely Fast Whisper achieves 19x speedup over standard Whisper large-v3, transcribing 150 minutes of audio in 98 seconds using Flash Attention 2 and batching with no quality loss.
ColBERT-Att: New Research Enhances Neural Retrieval by Integrating Attention into Late Interaction
Researchers propose ColBERT-Att, a novel neural information retrieval model that integrates attention weights into the late-interaction framework. The method shows improved recall accuracy on standard benchmarks like MS-MARCO, BEIR, and LoTTE.
LLMs Show 'Privileged Access' to Own Policies in Introspect-Bench, Explaining Self-Knowledge via Attention Diffusion
Researchers formalize LLM introspection as computation over model parameters, showing frontier models outperform peers at predicting their own behavior. The study provides causal evidence for how introspection emerges via attention diffusion without explicit training.
From Token to Item: New Research Proposes Item-Aware Attention to Enhance LLMs for Recommendation
Researchers propose an Item-Aware Attention Mechanism (IAM) that restructures how LLMs process product data for recommendations. It separates attention into intra-item (content) and inter-item (collaborative) layers to better model item-level relationships. This addresses a key limitation in current LLM-based recommenders.
Memory Sparse Attention (MSA) Enables 100M Token Context Windows with Minimal Performance Loss
Memory Sparse Attention (MSA) is a proposed architecture that allows AI models to store and reason over massive long-term memory directly within their attention mechanism, eliminating the need for external retrieval systems. The approach reportedly enables context windows of up to 100 million tokens with minimal performance degradation.
ByteDance Seed's Mixture-of-Depths Attention Reaches 97.3% of FlashAttention-2 Efficiency with 3.7% FLOPs Overhead
ByteDance Seed researchers introduced Mixture-of-Depths Attention (MoDA), an attention mechanism that addresses signal degradation in deep LLMs by allowing heads to attend to both current and previous layer KV pairs. The method achieves 97.3% of FlashAttention-2's efficiency while improving downstream performance by 2.11% with only a 3.7% computational overhead.
Cursor AI Code Editor Gains Viral Attention After Developer's Tweet About 'Biggest PR Win'
Cursor, an AI-powered code editor, received significant social media attention after a developer tweeted about its 'biggest PR win ever,' linking to a positive review. The tweet sparked discussion about the tool's growing popularity.
QV-Ka: New Research Proposes Eliminating Key Projection from Transformer Attention
A new arXiv paper argues the Key projection in Transformer attention is theoretically redundant. The proposed QV-Ka scheme removes it, simplifying architecture while maintaining performance on language tasks.
Moonshot AI's Kimi Introduces Attention Residuals to Mitigate Deep-Layer Information Loss in LLMs
Moonshot AI's Kimi team proposes Attention Residuals, a novel mechanism replacing standard residual connections. It allows each layer to attend to and selectively retrieve information from any previous layer, improving performance on long-context reasoning tasks.
Kimi Team's 'Attention Residuals' Replace Fixed Summation with Softmax Attention, Boosts GPQA-Diamond by +7.5%
Researchers propose Attention Residuals, a content-dependent alternative to standard residual connections in Transformers. The method improves scaling laws, matches a baseline trained with 1.25x more compute, and adds under 2% inference overhead.
New Research Shows How LLMs and Graph Attention Can Build Lightweight Strategic AI
A new arXiv paper proposes a hybrid AI framework for the Game of the Amazons that integrates LLMs with graph attention networks. It achieves strong performance in resource-constrained settings by using the LLM as a noisy supervisor and the graph network as a structural filter.
Decoding the First Token Fixation: How LLMs Develop Structural Attention Biases
New research reveals how large language models develop 'attention sinks'—disproportionate focus on the first input token—through a simple circuit mechanism that emerges early in training. This structural bias has significant implications for model interpretability and performance.
New AI Research: Cluster-Aware Attention-Based Deep RL for Pickup and Delivery Problems
Researchers propose CAADRL, a deep reinforcement learning framework that explicitly models clustered spatial layouts to solve complex pickup and delivery routing problems more efficiently. It matches state-of-the-art performance with significantly lower inference latency.
The Great Unbundling: How AI Is Decoupling Human Attention from Digital Execution
The current AI revolution represents a fundamental architectural shift from deterministic software systems requiring constant human oversight to probabilistic reasoning engines that autonomously execute tasks. This transition transforms developers from code writers to boundary condition designers, with profound implications for workflow automation and software development.
DeepSeek V4-Pro: 1.6T parameters, open weights, undercuts rivals 10x
DeepSeek unveiled V4-Pro and V4-Flash, its largest open-weight models with up to 1.6 trillion parameters and a 1M-token context window. The new hybrid attention architecture cuts compute for long contexts by 73–90%, enabling prices far below OpenAI, Google, and Anthropic.
Stanford Releases Free LLM & Transformer Cheatsheets Covering LoRA, RAG, MoE
Stanford University has released a free, open-source collection of cheatsheets covering core LLM concepts from self-attention to RAG and LoRA. This provides a consolidated technical reference for engineers and researchers.
UniMixer: A Unified Architecture for Scaling Laws in Recommendation Systems
A new arXiv paper introduces UniMixer, a unified scaling architecture for recommender systems. It bridges attention-based, TokenMixer-based, and factorization-machine-based methods into a single theoretical framework, aiming to improve parameter efficiency and scaling return on investment (ROI).
Anthropic Scrambles to Contain Major Source Code Leak for Claude Code
Anthropic is responding to a significant internal leak of approximately 500,000 lines of source code for its AI tool Claude Code, reportedly triggered by human error. The incident has drawn attention to security risks in the AI industry and coincides with reports of shifting investor interest toward Anthropic amid valuation disparities with competitors.
HyenaRec: A Polynomial-Based Architecture for Fast, Scalable Sequential Recommendation
Researchers propose HyenaRec, a novel sequential recommender using Legendre polynomial kernels and gated convolutions. It achieves better accuracy than attention-based models while training up to 6x faster, especially on long user histories. This addresses a critical efficiency bottleneck in next-item prediction.
VISTA: A Novel Two-Stage Framework for Scaling Sequential Recommenders to Lifelong User Histories
Researchers propose VISTA, a two-stage modeling framework that decomposes target attention to scale sequential recommendation to a million-item user history while keeping inference costs fixed. It has been deployed on a platform serving billions.
GateSID: A New Framework for Adaptive Cold-Start Recommendation Using Semantic IDs
Researchers propose GateSID, an adaptive gating framework that dynamically balances semantic and collaborative signals for cold-start items. It uses hierarchical Semantic IDs and adaptive attention to improve recommendations, showing +2.6% GMV in online tests.
Google's TurboQuant Cuts LLM KV Cache Memory by 6x, Enables 3-Bit Storage Without Accuracy Loss
Google released TurboQuant, a novel two-stage quantization algorithm that compresses the KV cache in long-context LLMs. It reduces memory by 6x, achieves 3-bit storage with no accuracy drop, and speeds up attention scoring by up to 8x on H100 GPUs.