Attention Residuals: A Content-Dependent Alternative to Standard Transformer Residual Connections
A technical report from the Kimi team proposes a fundamental architectural modification to the Transformer's residual connection mechanism. The work, titled "Attention Residuals," argues that the standard practice of blindly summing layer outputs with a fixed residual path forces uncontrolled hidden-state growth with depth and limits information flow.
What the Researchers Built
The core innovation is replacing the fixed, additive residual connection with a softmax attention operation over previous layer outputs. Instead of each layer receiving output = layer(x) + x (where x is the input from the previous layer), the new mechanism allows the layer to selectively retrieve the specific earlier representations it needs.
Formally, for a current layer l, the input is computed as a weighted combination of all previous layer outputs h_0, h_1, ..., h_{l-1}, where the weights are determined by a content-based attention score between the current layer's query and the keys of previous layers.
Key Results
The paper reports empirical gains across several benchmarks, comparing models using Attention Residuals against standard Transformer baselines of equivalent compute budget.
GPQA-Diamond +7.5% (exact metric not specified, presumed accuracy) HumanEval (Code Generation) +3.1% (pass@1) Scaling Law Efficiency Matches baseline performance trained with 1.25x more compute Inference Overhead < 2% added latencyThe results indicate that the content-dependent mixing of residuals improves model capability, particularly on reasoning-heavy tasks like GPQA, without significant computational cost.
How It Works: Blockwise Attention for Practical Scaling
The naive implementation of Attention Residuals—where each layer attends to all previous layers—would create a quadratic memory overhead with depth, making it impractical for large-scale models.
To solve this, the authors introduce a blockwise version. Layers are grouped into blocks (e.g., every 8 layers). Instead of attending to all individual layer outputs, the mechanism attends to a compressed summary representation for each block. This blockwise compression recovers most of the performance gains while keeping systems overhead minimal, leading to the reported sub-2% inference latency increase.
The training process and other hyperparameters (learning rate schedules, optimizer details) are presumed to be consistent with standard LLM pretraining, though the source tweet does not specify these details.
Why It Matters
Residual connections are a foundational, nearly unchanged component of modern LLM architectures. They were introduced to solve the vanishing gradient problem in very deep networks, enabling the training of models with hundreds of layers. However, their fixed, additive nature is a simplifying assumption. This work challenges that assumption, demonstrating that making the residual pathway content-dependent and selective is a more efficient way to propagate information through the network's depth.
The performance gains—especially matching a baseline that required 25% more compute—suggest Attention Residuals could lead to more compute-efficient scaling. The minimal inference overhead makes it a viable candidate for integration into production-scale models seeking better reasoning performance without a major latency trade-off.



