A research team from Google has introduced a surprisingly simple yet effective technique called Memory Caching that addresses a fundamental limitation of Recurrent Neural Networks (RNNs): their inability to retain long-term information in very long sequences. Published in the paper "Memory Caching: RNNs with Growing Memory" (Behrouz et al., 2026), the method enables RNNs to maintain access to historical context without resorting to the quadratic computational cost of Transformer attention.
The Core Problem: RNN Memory Compression
Modern RNN variants like LSTMs and GRUs compress the entire input sequence into a single fixed-size memory state. As new tokens are processed, this state is continuously updated, inevitably overwriting older information. This "memory compression bottleneck" has kept RNNs from matching Transformer performance on tasks requiring long-range recall, despite their superior O(L) sequential processing efficiency compared to Transformer's O(L²) attention cost.
What Memory Caching Does
The solution is conceptually straightforward: instead of maintaining only one memory state, the sequence is split into segments, and the RNN's memory state is saved as a checkpoint at the end of each segment. When generating output, each token can look back at all saved checkpoints, not just the current compressed memory.
This creates a tunable complexity trade-off:
Standard RNN O(L) Only current state Transformer O(L²) Full context window Memory Caching O(NL) N cached segment statesWhere L is sequence length and N is the number of segments. By adjusting N, the model smoothly interpolates between RNN-like efficiency and Transformer-like recall capability.
Four Implementation Strategies
The paper proposes and evaluates four specific mechanisms for utilizing cached memories:
- Residual Memory: The simplest approach—just sum all cached states for each token.
- Gated Residual Memory (GRM): Introduces input-dependent gates that weigh each segment's relevance to the current token. This method consistently performed best across tasks.
- Memory Soup: Interpolates the actual parameters of cached memories into a custom per-token network.
- Sparse Selective Caching (SSC): Uses MoE-style routing to select only the most relevant segments, reducing computational overhead.
Key Results and Performance
Experiments at academic scale (up to 1.3B parameters) show Memory Caching significantly closes the performance gap between RNNs and Transformers on recall-heavy tasks. When applied to already strong models like Google's Titans architecture, it pushes them further ahead on language understanding benchmarks.
Notably, the paper provides a clean theoretical insight: under simplifying assumptions, hybrid architectures that interleave RNN and attention layers can be viewed as a special case of Memory Caching. This explains why such hybrids have shown promise—they're implicitly caching memory states.
Transformers still maintain an advantage on the most challenging retrieval tasks (like UUID lookup in extremely long contexts), but Memory Caching establishes a viable middle ground between fixed-memory efficiency and full-context recall.
Technical Implementation Details
The GRM approach works by maintaining a cache of historical hidden states H = {h₁, h₂, ..., hₙ} from segment boundaries. For each current token at position t, the model computes:
gates = σ(W_g · [h_current, h_cached] + b_g)
weighted_memories = Σ(gates_i · h_cached_i)
h_enhanced = h_current + weighted_memories
Where σ is the sigmoid function, and the gates learn to attend to relevant historical segments. This adds minimal computational overhead while providing access to the entire sequence history.
What This Means in Practice
For practitioners working with long sequences, Memory Caching offers a practical alternative when full Transformer attention is computationally prohibitive. The method is particularly relevant for:
- Streaming applications where sequences are unbounded
- Edge deployment where memory and compute are constrained
- Hybrid systems that mix RNN and attention components
The team notes that all experiments are at academic scale, and whether these gains hold at frontier scale (100B+ parameters) remains an open question.
gentic.news Analysis
This work comes from the same Google Research team behind Titans and MIRAS, positioning it as part of a coherent research program on memory-augmented sequence models. The trend is clear: after years of Transformer dominance, there's renewed interest in making RNN-like architectures competitive for long-context tasks.
Memory Caching represents a pragmatic engineering solution rather than a theoretical breakthrough—it's essentially adding checkpointing to RNN training and inference. What's significant is how effectively this simple idea works, and how it provides a unified framework for understanding hybrid architectures.
This development aligns with broader industry efforts to reduce the quadratic cost of attention. Just last month, we covered Mamba-2's state-space model improvements and xLSTM's exponential gating mechanisms—both aiming for linear-time long-context modeling. Memory Caching takes a different, complementary approach: instead of redesigning the core architecture, it adds a lightweight caching mechanism on top of existing RNNs.
For AI engineers, the most immediate implication is that RNNs shouldn't be dismissed for long-sequence tasks. With techniques like Memory Caching, they can achieve near-Transformer recall with substantially better computational characteristics. The method's simplicity means it could be quickly adopted in production systems dealing with long documents, video processing, or real-time sensor data.
Frequently Asked Questions
How does Memory Caching compare to Transformer attention?
Memory Caching provides O(NL) complexity versus Transformer's O(L²), making it more efficient for very long sequences. However, it doesn't provide full token-to-token attention—instead, it gives each token access to summarized segment states. For most recall tasks, this proves sufficient, though Transformers still excel at precise token-level retrieval.
Can Memory Caching be combined with existing RNN architectures?
Yes, the paper demonstrates the technique working with standard LSTM and GRU architectures, as well as more recent variants. It's essentially a wrapper that can be added to any RNN-based model without modifying the core recurrence mechanism.
What are the practical limitations of this approach?
The main limitation is the need to choose segment boundaries. The paper uses fixed-length segments, but adaptive segmentation based on content could yield further improvements. Additionally, all experiments are at ≤1.3B scale—performance at larger scales remains unverified.
How does this relate to other long-context techniques like sliding window attention?
Memory Caching is complementary to attention-based methods. While sliding window attention restricts attention to nearby tokens, Memory Caching provides access to summarized information from the entire history. The two could potentially be combined for even better long-context performance.
Source: "Memory Caching: RNNs with Growing Memory" by Behrouz et al., Google Research, 2026. Experiments conducted at up to 1.3B parameters. Implementation details available in the forthcoming paper.









