DeepSeek-V4 now runs 500K context with 90% less KV cache using FlashMemory's Lookahead Sparse Attention. The approach keeps only 13.5% of cache in GPU memory without retraining the backbone.
Key facts
- 500,000-token context window achieved.
- KV cache reduced by 90%.
- Only 13.5% of cache stays in GPU memory.
- Zero backbone retraining required.
- Neural Memory Indexer predicts future token needs.
DeepSeek-V4 now supports 500,000-token context windows while reducing KV cache memory by 90%, according to @HuggingPapers. The improvement comes from FlashMemory, a new attention mechanism that introduces a tiny Neural Memory Indexer to predict which cache chunks future tokens will need.
The indexer keeps only 13.5% of the KV cache in GPU memory, yet the paper claims better accuracy than full-cache baselines. No backbone retraining is required, making the technique a drop-in replacement for existing attention layers.
How Lookahead Sparse Attention Works
Standard sparse attention methods prune cache based on fixed patterns or heuristics. FlashMemory instead trains a lightweight indexer—presumably a small MLP or transformer—to predict which KV cache entries the model will attend to in upcoming generation steps. This lookahead mechanism allows the system to prefetch only relevant chunks, slashing memory footprint.
The 90% reduction is significant: for a 500K-token context, typical KV cache would consume hundreds of gigabytes of HBM. FlashMemory cuts that to a fraction, enabling long-context inference on fewer GPUs.
Benchmarks and Claims
The source tweet claims "better accuracy" but does not disclose specific benchmark scores or the size of the Neural Memory Indexer. No comparisons against other sparse attention methods (e.g., StreamingLLM, H2O, or Quest) are provided in the tweet. The paper link was not shared in the source.
This is a pattern typical of HuggingPapers announcements: a high-level claim with limited numerical detail. Independent verification via a published arXiv preprint or code release would strengthen confidence.
Why This Matters
Long-context inference has been bottlenecked by KV cache memory, which grows linearly with sequence length. DeepSeek-V4's 500K context was previously impractical on consumer hardware. FlashMemory's zero-retraining requirement means existing models fine-tuned on DeepSeek-V3 or earlier versions can immediately benefit.
If the accuracy claim holds, this would be a Pareto improvement over prior sparse attention methods—no trade-off between memory and quality. However, the absence of ablation studies or comparison tables makes it impossible to evaluate the claim rigorously.
What to Watch
Watch for the release of the FlashMemory paper and code. Key metrics to verify: accuracy on long-context benchmarks like RULER or LongBench, the indexer's parameter count, and whether the 13.5% figure holds across different sequence lengths and batch sizes. If DeepSeek publishes an arXiv preprint with full ablations, the confidence in this approach will increase significantly.





