Listen to today's AI briefing

Daily podcast — 5 min, AI-narrated summary of top stories

DeepSeek-V4: 1M-Token Context at 10% KV Cache Cost

DeepSeek-V4 achieves 1M-token context with 10% KV cache and 27% inference FLOPs of V3.2 via hybrid CSA/HCA attention, making long-context open-weight models practical.

GAla Smith & AI Research Desk·5h ago·5 min read·7 views·AI-Generated·Report error

Source: x.comvia @akshay_pachaarSingle Source

DeepSeek-V4 Brings Million-Token Context to Open Models — at a Fraction of the Cost

DeepSeek has released V4, the latest iteration of their open-weight language model family, and the headline feature is a dramatic reduction in the cost of long-context inference. According to technical details shared by the team, V4 runs 1M-token contexts using only 10% of the KV cache and 27% of the inference FLOPs compared to V3.2.

This isn't a marginal improvement — it's the kind of efficiency gain that changes what's economically viable. Previously, serving a 1M-token context required expensive hardware or severely limited throughput. V4 makes long-context inference something you can run by default on smaller machines.

What the Numbers Mean

KV cache 10% Serve longer contexts on smaller GPUs, or pack more concurrent users Inference FLOPs 27% Each generated token at 1M context is ~4x cheaper to compute

KV cache is the memory footprint your GPU holds for every token already in context. It grows linearly with context length, and at 1M tokens it's usually what forces you onto bigger hardware or kills your throughput. Cutting it to 10% means you can serve longer contexts on smaller machines, or pack far more concurrent users on the same ones.

Inference FLOPs is the compute cost of generating the next token. With vanilla attention this scales quadratically with context length, which is why long contexts get brutally expensive per token. 27% means each generated token at 1M context is nearly 4x cheaper to produce.

How It Works: Hybrid Attention Design

The core innovation is a hybrid attention design that interleaves two mechanisms instead of picking one.

CSA (Compressed Sparse Attention) first squashes every 4 KV entries into a single compressed entry. Then it uses a lightning indexer to select the top-k most relevant compressed blocks. Compression and sparsity stacked.

HCA (Heavily Compressed Attention) goes aggressive. It compresses every 128 tokens into one entry and skips sparse selection entirely, because at that compression ratio dense attention over a tiny set is already cheap.

Both get a sliding window branch over the last 128 tokens, so local fine-grained structure isn't lost to compression.

The result is that CSA handles medium-grained retrieval while HCA handles coarse context summarization, and alternating them across layers gives you both without paying for both at full cost.

Model Specs and Performance

V4-Pro (1.6T total parameters, 49B active) ranks 23rd among human Codeforces competitors and hits 120/120 on Putnam-2025. These are competitive coding and mathematical reasoning benchmarks that indicate the model isn't sacrificing quality for efficiency.

The model is available as open weights on Hugging Face.

What This Means in Practice

Long-context inference has been a premium feature you ration — reserved for specialized use cases like document analysis or codebase-wide refactoring. V4's efficiency gains make it something you can run by default. For practitioners, this means:

You can serve 1M-token contexts on a single A100 or H100 without custom infrastructure
Throughput for long-context queries is high enough for production workloads
The cost per token at long context is now comparable to standard-length inference on previous models

gentic.news Analysis

DeepSeek's trajectory is becoming one of the most interesting narratives in open-weight AI. V3.2 already demonstrated competitive performance against GPT-4 class models. Now V4 addresses the practical deployment bottleneck that has kept long-context models in the lab.

The hybrid attention design is particularly clever because it doesn't force a tradeoff between compression and retrieval quality. CSA handles the medium-range dependencies that matter for tasks like document QA, while HCA provides a cheap summary of the full context. The sliding window preserves local structure. This layered approach is likely to influence future attention architectures across the field.

The Codeforces and Putnam benchmarks are worth noting. These aren't synthetic evaluations — they're real competitive programming and math competition problems. A model that ranks 23rd among human Codeforces competitors is genuinely useful for coding tasks, not just a research curiosity.

The open-weight release is the key strategic move. By making V4 available on Hugging Face, DeepSeek ensures that the ecosystem of fine-tuning, quantization, and deployment tools will build around their architecture. This is how you win in open-source AI — not just with benchmark scores, but with practical deployability.

Frequently Asked Questions

What hardware do I need to run DeepSeek-V4 with 1M-token context?

Based on the KV cache reduction to 10% of V3.2, a single A100-80GB or H100 should be sufficient for serving 1M-token contexts with reasonable batch sizes. The exact requirements depend on your throughput needs, but the efficiency gains are designed to make this feasible on hardware that's already common in production deployments.

How does DeepSeek-V4 compare to GPT-4's long-context capabilities?

DeepSeek has not published direct comparisons to GPT-4 on long-context benchmarks. However, the 1M-token context window matches or exceeds what GPT-4 offers (128K for GPT-4 Turbo, up to 1M for some variants). The key difference is that V4 is open-weight and achieves this at dramatically lower inference cost.

What is the difference between CSA and HCA in DeepSeek-V4?

CSA (Compressed Sparse Attention) compresses every 4 KV entries into one, then uses a sparse indexer to select the top-k relevant blocks. HCA (Heavily Compressed Attention) compresses every 128 tokens into one entry and uses dense attention over the tiny resulting set. CSA handles medium-range retrieval; HCA provides coarse context summarization. They are interleaved across layers.

Is DeepSeek-V4 available for commercial use?

The model is released as open weights on Hugging Face. DeepSeek's previous models have been released under permissive licenses that allow commercial use. Check the specific license on the Hugging Face model page for V4's terms.

This article is based on technical details shared by the DeepSeek team via social media. Full benchmarks and technical paper are expected to follow.

Following this story?

Get a weekly digest with AI predictions, trends, and analysis — free.

AI Analysis

The hybrid attention design in DeepSeek-V4 is a practical solution to a problem that has plagued long-context models: the quadratic cost of attention. By interleaving two compression levels — one sparse and medium-grained (CSA), one dense and heavily compressed (HCA) — they avoid the worst-case scenario where a single compression ratio either loses too much information or doesn't save enough compute. The sliding window branch is a clever addition that prevents local coherence from being lost to compression artifacts. What's particularly interesting is that this approach doesn't require custom hardware or exotic kernels. The operations are standard attention with compression and sparse indexing, which means existing inference infrastructure can likely support it with moderate modifications. This is in contrast to other long-context solutions like Ring Attention or FlashAttention variants that require more significant engineering investment. The 1.6T parameter count with 49B active parameters suggests a MoE architecture similar to Mixtral or DeepSeek's previous models. The fact that they can achieve these efficiency gains while maintaining competitive reasoning performance (120/120 on Putnam-2025) indicates that the attention design doesn't degrade quality for short-to-medium context tasks. This is a strong signal that the hybrid approach is robust across context lengths.

#inference optimization #open source ai #attention mechanisms #large language models #deepseek

Mentioned in this article

DeepSeek DeepSeek V4 DeepSeek-V3 Key-Value cache inference FLOPs hybrid CSA/HCA attention

Enjoyed this article?

Get the weekly AI intelligence briefing

Products & Launches2 shared topics

DeepSeek-V4: 1M-Token Context at 10% KV Cache Cost

DeepSeek-V4 Brings Million-Token Context to Open Models — at a Fraction of the Cost

What the Numbers Mean

How It Works: Hybrid Attention Design

Model Specs and Performance

What This Means in Practice

gentic.news Analysis

Frequently Asked Questions

What hardware do I need to run DeepSeek-V4 with 1M-token context?

How does DeepSeek-V4 compare to GPT-4's long-context capabilities?

What is the difference between CSA and HCA in DeepSeek-V4?

Is DeepSeek-V4 available for commercial use?

AI Analysis

Related Articles

rAIcast Episode 2 Analyzes DeepSeek V4, Claude Mythos, and AI Law

DeepSeek-V4 Rumored as 'Whale' Returns, Signaling Major Model Release

AI Weekly: GPT-6 Rumors, DeepSeek V4 on Huawei, Anthropic Models, Qwen 3.6-Plus

DeepSeek V4 to Run on Huawei Ascend 950PR Chips, Sparking 20% Price Surge

DeepSeek V4 Emerges: China's Next AI Contender Takes Shape

The Whale Approaches: DeepSeek v4 Looms as China's Next AI Power Play

More in AI Research

Meta's Sapiens2: 1B Human Image ViTs for Pose, Segmentation, Normals

GPT-5.5 Dominates AI Cost-Performance Frontier

GPT-5.5 Benchmarks Surface: What the Numbers Show