Kimi Team's 'Attention Residuals' Replace Fixed Summation with Softmax Attention, Boosts GPQA-Diamond by +7.5%
AI ResearchScore: 95

Kimi Team's 'Attention Residuals' Replace Fixed Summation with Softmax Attention, Boosts GPQA-Diamond by +7.5%

Researchers propose Attention Residuals, a content-dependent alternative to standard residual connections in Transformers. The method improves scaling laws, matches a baseline trained with 1.25x more compute, and adds under 2% inference overhead.

3h ago·3 min read·12 views·via @omarsar0
Share:

Attention Residuals: A Content-Dependent Alternative to Standard Transformer Residual Connections

A technical report from the Kimi team proposes a fundamental architectural modification to the Transformer's residual connection mechanism. The work, titled "Attention Residuals," argues that the standard practice of blindly summing layer outputs with a fixed residual path forces uncontrolled hidden-state growth with depth and limits information flow.

What the Researchers Built

The core innovation is replacing the fixed, additive residual connection with a softmax attention operation over previous layer outputs. Instead of each layer receiving output = layer(x) + x (where x is the input from the previous layer), the new mechanism allows the layer to selectively retrieve the specific earlier representations it needs.

Formally, for a current layer l, the input is computed as a weighted combination of all previous layer outputs h_0, h_1, ..., h_{l-1}, where the weights are determined by a content-based attention score between the current layer's query and the keys of previous layers.

Key Results

The paper reports empirical gains across several benchmarks, comparing models using Attention Residuals against standard Transformer baselines of equivalent compute budget.

GPQA-Diamond +7.5% (exact metric not specified, presumed accuracy) HumanEval (Code Generation) +3.1% (pass@1) Scaling Law Efficiency Matches baseline performance trained with 1.25x more compute Inference Overhead < 2% added latency

The results indicate that the content-dependent mixing of residuals improves model capability, particularly on reasoning-heavy tasks like GPQA, without significant computational cost.

How It Works: Blockwise Attention for Practical Scaling

The naive implementation of Attention Residuals—where each layer attends to all previous layers—would create a quadratic memory overhead with depth, making it impractical for large-scale models.

To solve this, the authors introduce a blockwise version. Layers are grouped into blocks (e.g., every 8 layers). Instead of attending to all individual layer outputs, the mechanism attends to a compressed summary representation for each block. This blockwise compression recovers most of the performance gains while keeping systems overhead minimal, leading to the reported sub-2% inference latency increase.

The training process and other hyperparameters (learning rate schedules, optimizer details) are presumed to be consistent with standard LLM pretraining, though the source tweet does not specify these details.

Why It Matters

Residual connections are a foundational, nearly unchanged component of modern LLM architectures. They were introduced to solve the vanishing gradient problem in very deep networks, enabling the training of models with hundreds of layers. However, their fixed, additive nature is a simplifying assumption. This work challenges that assumption, demonstrating that making the residual pathway content-dependent and selective is a more efficient way to propagate information through the network's depth.

The performance gains—especially matching a baseline that required 25% more compute—suggest Attention Residuals could lead to more compute-efficient scaling. The minimal inference overhead makes it a viable candidate for integration into production-scale models seeking better reasoning performance without a major latency trade-off.

AI Analysis

This is a targeted, architectural intervention at a component most consider 'solved.' The standard residual connection is so ubiquitous it's rarely questioned. The Kimi team's approach is conceptually elegant: treat the problem of combining current and past representations as a retrieval problem, which is precisely what attention mechanisms are designed for. The blockwise compression is the critical engineering insight that makes it practical; it's a classic trade-off between granularity and efficiency. Practitioners should note where the gains are largest: GPQA-Diamond, a notoriously difficult, graduate-level reasoning benchmark. A +7.5% lift there is substantial and suggests the method improves the model's ability to integrate and reason over complex information across many layers. The more modest gain on HumanEval (+3.1%) is still meaningful for code generation. The claim of matching a 1.25x compute baseline is significant if validated in larger-scale training runs. It implies a direct improvement to scaling laws, meaning you could either get a better model for the same cost or the same model for less cost. The next step is to see this technique implemented and tested in an open-source model family (like Llama or Mistral) to verify the gains generalize outside the authors' own training infrastructure.
Original sourcex.com

Trending Now

More in AI Research

View all