Listen to today's AI briefing

Daily podcast — 5 min, AI-narrated summary of top stories

Researchers present a diagram of Attention Residuals in a Transformer, showing softmax attention replacing fixed…

Kimi Team's 'Attention Residuals' Replace Fixed Summation with Softmax Attention, Boosts GPQA-Diamond by +7.5%

Researchers propose Attention Residuals, a content-dependent alternative to standard residual connections in Transformers. The method improves scaling laws, matches a baseline trained with 1.25x more compute, and adds under 2% inference overhead.

AAAla SMITH & AI Research Desk·Mar 16, 2026·3 min read··212 views·AI-Generated·Report error

Source: x.comvia @omarsar0Corroborated

Attention Residuals: A Content-Dependent Alternative to Standard Transformer Residual Connections

A technical report from the Kimi team proposes a fundamental architectural modification to the Transformer's residual connection mechanism. The work, titled "Attention Residuals," argues that the standard practice of blindly summing layer outputs with a fixed residual path forces uncontrolled hidden-state growth with depth and limits information flow.

What the Researchers Built

The core innovation is replacing the fixed, additive residual connection with a softmax attention operation over previous layer outputs. Instead of each layer receiving output = layer(x) + x (where x is the input from the previous layer), the new mechanism allows the layer to selectively retrieve the specific earlier representations it needs.

Formally, for a current layer l, the input is computed as a weighted combination of all previous layer outputs h_0, h_1, ..., h_{l-1}, where the weights are determined by a content-based attention score between the current layer's query and the keys of previous layers.

Key Results

The paper reports empirical gains across several benchmarks, comparing models using Attention Residuals against standard Transformer baselines of equivalent compute budget.

GPQA-Diamond +7.5% (exact metric not specified, presumed accuracy) HumanEval (Code Generation) +3.1% (pass@1) Scaling Law Efficiency Matches baseline performance trained with 1.25x more compute Inference Overhead < 2% added latency

The results indicate that the content-dependent mixing of residuals improves model capability, particularly on reasoning-heavy tasks like GPQA, without significant computational cost.

How It Works: Blockwise Attention for Practical Scaling

The naive implementation of Attention Residuals—where each layer attends to all previous layers—would create a quadratic memory overhead with depth, making it impractical for large-scale models.

To solve this, the authors introduce a blockwise version. Layers are grouped into blocks (e.g., every 8 layers). Instead of attending to all individual layer outputs, the mechanism attends to a compressed summary representation for each block. This blockwise compression recovers most of the performance gains while keeping systems overhead minimal, leading to the reported sub-2% inference latency increase.

The training process and other hyperparameters (learning rate schedules, optimizer details) are presumed to be consistent with standard LLM pretraining, though the source tweet does not specify these details.

Why It Matters

Residual connections are a foundational, nearly unchanged component of modern LLM architectures. They were introduced to solve the vanishing gradient problem in very deep networks, enabling the training of models with hundreds of layers. However, their fixed, additive nature is a simplifying assumption. This work challenges that assumption, demonstrating that making the residual pathway content-dependent and selective is a more efficient way to propagate information through the network's depth.

The performance gains—especially matching a baseline that required 25% more compute—suggest Attention Residuals could lead to more compute-efficient scaling. The minimal inference overhead makes it a viable candidate for integration into production-scale models seeking better reasoning performance without a major latency trade-off.

Source: gentic.news · Mar 16, 2026 · author=Ala SMITH · citation.json

AI-assisted reporting. Generated by gentic.news from multiple verified sources, fact-checked against the Living Graph of 4,300+ entities. Edited by Ala SMITH.

Following this story?

Get a weekly digest with AI predictions, trends, and analysis — free.

AI Analysis

This is a targeted, architectural intervention at a component most consider 'solved.' The standard residual connection is so ubiquitous it's rarely questioned. The Kimi team's approach is conceptually elegant: treat the problem of combining current and past representations as a retrieval problem, which is precisely what attention mechanisms are designed for. The blockwise compression is the critical engineering insight that makes it practical; it's a classic trade-off between granularity and efficiency. Practitioners should note where the gains are largest: GPQA-Diamond, a notoriously difficult, graduate-level reasoning benchmark. A +7.5% lift there is substantial and suggests the method improves the model's ability to integrate and reason over complex information across many layers. The more modest gain on HumanEval (+3.1%) is still meaningful for code generation. The claim of matching a 1.25x compute baseline is significant if validated in larger-scale training runs. It implies a direct improvement to scaling laws, meaning you could either get a better model for the same cost or the same model for less cost. The next step is to see this technique implemented and tested in an open-source model family (like Llama or Mistral) to verify the gains generalize outside the authors' own training infrastructure.

#architecture #efficiency #transformer #research

Compare side-by-side

Attention Residuals vs transformer model

→

Mentioned in this article

Kimi Team Attention Residuals transformer model GPQA Diamond

Enjoyed this article?

Get the weekly AI intelligence briefing

✨AI Toolslive

Five one-click lenses on this article. Cached for 24h.

Pick a tool above to generate an instant lens on this article.

AI Research

Google’s Virgo network interconnects 134K TPUv8t chips at 47 Pbps

From the lab

The framework underneath this story

Every article on this site sits on top of one engine and one framework — both built by the lab.

Original research · EUMAS 2026

MNEMA — A Witness Lattice for Multi-Agent AI Memory

Cryptographic memory units · 1−α detection floor · 15 pp PDF

Field framework · v1.0

Epistemic Infrastructure

12 pillars · 11-stage knowledge metabolism · pathology catalog

More in AI Research

View all

AI Research

Visual-Seeker: Active Visual Reasoning Beats Proprietary MLLMs on 5 Benchmarks

Visual-Seeker achieves SOTA on five multimodal search benchmarks, surpassing proprietary models by actively harvesting visual evidence during search.

arxiv.org/15h ago/3 min read

agentsresearchmultimodal

Two researchers in a lab analyzing a chart showing cost reduction, with a laptop displaying a graph of annotation…

AI Research

Metric Match Cuts LLM Judge Annotation Cost 32.5% via Subset Selection

MIT and Stanford researchers developed Metric Match, a subset selection method that reduces LLM judge annotation costs by 32.5% and estimation error by 18.7%, achieving a 0.838 win-rate against random selection.

arxiv.org/15h ago/3 min read

paperresearchllm

Researchers analyze fusion strategies on a computer dashboard displaying patient data and survival curves for PE…

AI Research

No single fusion strategy wins

Zhang et al. test 4 fusion strategies on 7K+ patients, finding no universal best. Contrastive alignment with CLMBR wins for PE mortality; cross-attention and co-attention split for CVD.

arxiv.org/15h ago/3 min read

healthcare aimultimodal learningai research

What the Researchers Built

Key Results

How It Works: Blockwise Attention for Practical Scaling

Why It Matters

AI Analysis

✨AI Toolslive

Related Articles

Google Open-Sources DiffusionGemma, 26B Model Hits 1K Tokens/Sec on H100

Stanford, Meta 'Code as Agent Harness' Paper Rethinks AI Agent Design

Selective Attackers Cut Agent Safety by 28pp, Paper Finds

Chinese LLMs Surge on OpenRouter as U.S. AI Traffic Shifts

DeepMind paper: hidden web content hijacks agents 86% of the time

Google’s Virgo network interconnects 134K TPUv8t chips at 47 Pbps

The framework underneath this story

More in AI Research

Visual-Seeker: Active Visual Reasoning Beats Proprietary MLLMs on 5 Benchmarks

Metric Match Cuts LLM Judge Annotation Cost 32.5% via Subset Selection

No single fusion strategy wins