What Happened
Moonshot AI, the company behind the Kimi long-context large language model, has introduced a new architectural component called Attention Residuals. The mechanism, described in a technical post by researcher Rohan Paul, addresses a fundamental limitation in standard transformer-based LLMs: the progressive degradation or loss of information from early layers as signals pass through deeper layers via simple additive residual connections.
The core problem is that in a standard transformer, each layer's output is added to a cumulative sum of all previous layers' outputs. This creates a "giant messy pile of added data" where subtle but critical information processed in the first few layers—such as key entities or facts in a long document—can become "completely buried under the weight of the newer layers." This leads to the model effectively "forgetting its initial thoughts" during extended reasoning chains.
The Technical Fix: From Additive to Attentive Connections
The proposed Attention Residual mechanism replaces the standard, blind addition with a form of cross-layer attention. Each layer is equipped with what the post describes as a "special spotlight tool." Instead of receiving the monolithic sum of all previous outputs, a layer can use this attention mechanism to look back at the individual output of every single past layer.
Here’s how it works:
- For a given layer
L, the mechanism computes a relevance score for the output of each preceding layer (1 throughL-1). - These scores are based on the current context and what layer
L"needs to figure out." - The layer then performs a weighted combination of all past layer outputs, pulling forward the most relevant information. The post uses the analogy: "If layer fifty needs a specific noun that was processed way back in layer two, it simply shines its spotlight on layer two and pulls that exact data forward."
This creates a dynamic, content-aware pathway for information flow, allowing the model to maintain access to foundational context throughout deep processing.
Scaling with Block Attention Residuals
A naive implementation of full cross-layer attention would be computationally prohibitive for a model with dozens or hundreds of layers, as it would require O(L²) comparisons.
To solve this, the team developed Block Attention Residuals. This method groups consecutive layers into chunks or blocks. Attention is applied at the block level: a layer can attend to the outputs of previous blocks, rather than every individual past layer. This significantly reduces the memory and computational overhead while preserving the core benefit of selective long-range access. The post notes this "block method speeds up processing while still letting the model easily reach back for missing context."
The mechanism was tested in the Kimi Linear 48B model architecture (48 billion parameters), where it was reported to make "everything run smoother" and enable the model to "handle incredibly complex reasoning tasks much better because it never loses track of the foundational clues it picked up at the start."
Context and Implications
This development is part of the ongoing research frontier focused on improving transformer efficiency and capability, especially for long-context tasks. While the source post does not include specific benchmark numbers, the described problem—vanishing early-layer signals—is a known challenge in training very deep networks. Solutions like this aim to move beyond the fixed, local connectivity of standard residual networks towards more flexible, non-local interaction, similar in spirit to techniques like Transformer-XL's segment-level recurrence or Compressive Transformers, but applied within a single forward pass of a model's layers rather than across sequence segments.
The work from Moonshot AI appears to be an engineering-focused innovation to enhance the reasoning coherence of their flagship Kimi model, which is marketed for its long-context capabilities (reportedly up to 1 million tokens).



