NVIDIA's Blackwell inference stack slashed DeepSeek V4 token costs by up to 5x in one month. According to @rohanpaul_ai, a newly published NVIDIA report claims the dramatic reduction.
Key facts
- 5x reduction in DeepSeek V4 token costs in one month
- NVIDIA report claims Blackwell inference stack as the cause
- DeepSeek V4 has 1.5 trillion parameters, 370B active per token
- Prior estimated inference cost: $0.50 per million tokens on H100
- Report shared via @rohanpaul_ai on X, not peer-reviewed
The claim, sourced from an NVIDIA report shared by @rohanpaul_ai on X, positions Blackwell as a significant leap in inference efficiency for large language models. The 5x cost reduction applies to DeepSeek V4, a model released in early 2025 that has been noted for its competitive performance against frontier models from OpenAI and Anthropic.
NVIDIA has not publicly detailed the specific optimizations—whether they involve FP4 quantization, speculative decoding, or improved tensor core utilization—but the timeline of one month suggests rapid engineering iteration rather than a fundamental architecture change. The report likely compares token costs on Blackwell B200 or B300 GPUs against earlier Hopper H100 deployments.
This result, if independently verified, would challenge the prevailing narrative that inference costs are plateauing. DeepSeek V4, with its 1.5 trillion parameters and Mixture-of-Experts architecture, is notoriously expensive to serve; a 5x reduction could make it viable for real-time applications at scale.
Context and Caveats

DeepSeek V4, released in February 2025, uses a MoE architecture with 370 billion active parameters per token. Prior reports estimated its inference cost at roughly $0.50 per million tokens on H100 clusters. A 5x reduction would bring that to $0.10 per million tokens, competitive with GPT-4o-mini pricing.
However, NVIDIA's report is a vendor's internal benchmark, not a peer-reviewed study. The company did not disclose the test methodology, hardware count, or whether the cost includes electricity, cooling, or amortized hardware. Independent validation from cloud providers like CoreWeave or Lambda Labs would strengthen the claim.
Strategic Implications

The timing is notable. DeepSeek V4 has gained traction among cost-sensitive enterprises, and a 5x inference cost reduction from NVIDIA's latest silicon could accelerate adoption. It also pressures AMD and Intel, whose MI400 and Gaudi 3 chips are targeting similar inference workloads.
NVIDIA's move mirrors a broader trend: as model sizes grow, inference optimization becomes the key differentiator for hardware vendors. The company's dominance in training (95%+ market share) is now being reinforced in inference, where software optimizations like TensorRT-LLM and Blackwell's hardware features create a moat.
What to watch
Watch for independent validation from cloud GPU providers like CoreWeave or Lambda Labs running Blackwell clusters with DeepSeek V4. Also track NVIDIA's Q3 earnings call for any mention of inference revenue share versus training.









