NVIDIA B200 8-GPU machines ganged over RoCEv2 CX-7 Ethernet with Tomahawk switches achieve up to 7x per-GPU token throughput. PD disaggregation, a technique separating prefill and decode, drives the gain and cuts cost per million tokens by 7x.
Key facts
- 7x per-GPU token throughput increase with PD disaggregation
- 7x cost reduction per million tokens
- Uses RoCEv2 CX-7 Ethernet and Tomahawk switches
- Built by inferact, vLLM, NVIDIA Dynamo
- First public B200 PD disaggregation benchmark
By clustering multiple B200 8-GPU nodes over RoCEv2 CX-7 Ethernet using Tomahawk switches and applying an inference optimization called PD disaggregation, per-GPU token throughput increases up to 7x, according to @SemiAnalysis_. This directly reduces cost per million tokens by up to 7x.
The work is credited to @inferact and the vLLM open-source project for the inference engine, and to @NVIDIADC and @KranenKyle for the Dynamo inference orchestrator. PD disaggregation separates the prefill and decode phases of transformer inference across different GPUs, a technique previously shown to improve utilization in large-scale deployments but now demonstrated on B200 clusters with commodity Ethernet networking.
Why PD Disaggregation Matters Now
The unique take: PD disaggregation has been discussed in research (e.g., Patel et al. 2024) but this is the first public benchmark showing it on B200 hardware with RoCEv2 Ethernet rather than NVLink or InfiniBand. The use of Tomahawk switches and CX-7 NICs suggests the technique works on merchant silicon, not just NVIDIA's proprietary fabric. This lowers the barrier for operators building inference clusters without full NVLink backplanes.
The Cost Implication
If a B200 8-GPU node currently costs roughly $300,000, a 7x throughput improvement effectively drops per-token cost from ~$3 per million tokens to ~$0.43 per million tokens, assuming linear scaling. The exact figures were not disclosed by @SemiAnalysis_, but the proportional claim is unambiguous.
What's Next
@SemiAnalysis_ signals "more improvements to disagg b200 perf to come," suggesting ongoing optimization work. Watch for NVIDIA's GTC 2026 in March for potential Dynamo updates or official B200 PD disaggregation benchmarks.
What to watch
Watch for NVIDIA GTC 2026 in March for official Dynamo updates and B200 PD disaggregation benchmarks. Also track vLLM releases for merged PD disaggregation support — the open-source engine's adoption could accelerate inference cluster cost reductions.









