DeepSeek-V4 Brings Million-Token Context to Open Models — at a Fraction of the Cost
DeepSeek has released V4, the latest iteration of their open-weight language model family, and the headline feature is a dramatic reduction in the cost of long-context inference. According to technical details shared by the team, V4 runs 1M-token contexts using only 10% of the KV cache and 27% of the inference FLOPs compared to V3.2.
This isn't a marginal improvement — it's the kind of efficiency gain that changes what's economically viable. Previously, serving a 1M-token context required expensive hardware or severely limited throughput. V4 makes long-context inference something you can run by default on smaller machines.
What the Numbers Mean
KV cache 10% Serve longer contexts on smaller GPUs, or pack more concurrent users Inference FLOPs 27% Each generated token at 1M context is ~4x cheaper to computeKV cache is the memory footprint your GPU holds for every token already in context. It grows linearly with context length, and at 1M tokens it's usually what forces you onto bigger hardware or kills your throughput. Cutting it to 10% means you can serve longer contexts on smaller machines, or pack far more concurrent users on the same ones.
Inference FLOPs is the compute cost of generating the next token. With vanilla attention this scales quadratically with context length, which is why long contexts get brutally expensive per token. 27% means each generated token at 1M context is nearly 4x cheaper to produce.
How It Works: Hybrid Attention Design
The core innovation is a hybrid attention design that interleaves two mechanisms instead of picking one.
CSA (Compressed Sparse Attention) first squashes every 4 KV entries into a single compressed entry. Then it uses a lightning indexer to select the top-k most relevant compressed blocks. Compression and sparsity stacked.
HCA (Heavily Compressed Attention) goes aggressive. It compresses every 128 tokens into one entry and skips sparse selection entirely, because at that compression ratio dense attention over a tiny set is already cheap.
Both get a sliding window branch over the last 128 tokens, so local fine-grained structure isn't lost to compression.
The result is that CSA handles medium-grained retrieval while HCA handles coarse context summarization, and alternating them across layers gives you both without paying for both at full cost.
Model Specs and Performance
V4-Pro (1.6T total parameters, 49B active) ranks 23rd among human Codeforces competitors and hits 120/120 on Putnam-2025. These are competitive coding and mathematical reasoning benchmarks that indicate the model isn't sacrificing quality for efficiency.
The model is available as open weights on Hugging Face.
What This Means in Practice
Long-context inference has been a premium feature you ration — reserved for specialized use cases like document analysis or codebase-wide refactoring. V4's efficiency gains make it something you can run by default. For practitioners, this means:
- You can serve 1M-token contexts on a single A100 or H100 without custom infrastructure
- Throughput for long-context queries is high enough for production workloads
- The cost per token at long context is now comparable to standard-length inference on previous models
gentic.news Analysis
DeepSeek's trajectory is becoming one of the most interesting narratives in open-weight AI. V3.2 already demonstrated competitive performance against GPT-4 class models. Now V4 addresses the practical deployment bottleneck that has kept long-context models in the lab.
The hybrid attention design is particularly clever because it doesn't force a tradeoff between compression and retrieval quality. CSA handles the medium-range dependencies that matter for tasks like document QA, while HCA provides a cheap summary of the full context. The sliding window preserves local structure. This layered approach is likely to influence future attention architectures across the field.
The Codeforces and Putnam benchmarks are worth noting. These aren't synthetic evaluations — they're real competitive programming and math competition problems. A model that ranks 23rd among human Codeforces competitors is genuinely useful for coding tasks, not just a research curiosity.
The open-weight release is the key strategic move. By making V4 available on Hugging Face, DeepSeek ensures that the ecosystem of fine-tuning, quantization, and deployment tools will build around their architecture. This is how you win in open-source AI — not just with benchmark scores, but with practical deployability.
Frequently Asked Questions
What hardware do I need to run DeepSeek-V4 with 1M-token context?
Based on the KV cache reduction to 10% of V3.2, a single A100-80GB or H100 should be sufficient for serving 1M-token contexts with reasonable batch sizes. The exact requirements depend on your throughput needs, but the efficiency gains are designed to make this feasible on hardware that's already common in production deployments.
How does DeepSeek-V4 compare to GPT-4's long-context capabilities?
DeepSeek has not published direct comparisons to GPT-4 on long-context benchmarks. However, the 1M-token context window matches or exceeds what GPT-4 offers (128K for GPT-4 Turbo, up to 1M for some variants). The key difference is that V4 is open-weight and achieves this at dramatically lower inference cost.
What is the difference between CSA and HCA in DeepSeek-V4?
CSA (Compressed Sparse Attention) compresses every 4 KV entries into one, then uses a sparse indexer to select the top-k relevant blocks. HCA (Heavily Compressed Attention) compresses every 128 tokens into one entry and uses dense attention over the tiny resulting set. CSA handles medium-range retrieval; HCA provides coarse context summarization. They are interleaved across layers.
Is DeepSeek-V4 available for commercial use?
The model is released as open weights on Hugging Face. DeepSeek's previous models have been released under permissive licenses that allow commercial use. Check the specific license on the Hugging Face model page for V4's terms.
This article is based on technical details shared by the DeepSeek team via social media. Full benchmarks and technical paper are expected to follow.








