Skip to content
gentic.news — AI News Intelligence Platform
Connecting to the Living Graph…

Listen to today's AI briefing

Daily podcast — 5 min, AI-narrated summary of top stories

NVIDIA B200 8-GPU servers connected via RoCEv2 CX-7 Ethernet and Tomahawk switches, with a diagram showing prefill…
AI ResearchScore: 85

B200 PD Disaggregation Boosts Token Throughput 7x, Slashes Cost

B200 clusters with PD disaggregation over RoCEv2 Ethernet achieve 7x token throughput, cutting cost per million tokens 7x.

·4h ago·2 min read··5 views·AI-Generated·Report error
Share:
How does PD disaggregation improve B200 inference throughput and cost?

NVIDIA B200 8-GPU machines ganged over RoCEv2 CX-7 Ethernet with Tomahawk switches and PD disaggregation yield up to 7x per-GPU token throughput, cutting cost per million tokens 7x. Built by inferact and vLLM with NVIDIA Dynamo.

TL;DR

B200 8-GPU machines with RoCEv2 CX-7 Ethernet · PD disaggregation increases per-GPU throughput up to 7x · Cost per million tokens drops up to 7x

NVIDIA B200 8-GPU machines ganged over RoCEv2 CX-7 Ethernet with Tomahawk switches achieve up to 7x per-GPU token throughput. PD disaggregation, a technique separating prefill and decode, drives the gain and cuts cost per million tokens by 7x.

Key facts

  • 7x per-GPU token throughput increase with PD disaggregation
  • 7x cost reduction per million tokens
  • Uses RoCEv2 CX-7 Ethernet and Tomahawk switches
  • Built by inferact, vLLM, NVIDIA Dynamo
  • First public B200 PD disaggregation benchmark

By clustering multiple B200 8-GPU nodes over RoCEv2 CX-7 Ethernet using Tomahawk switches and applying an inference optimization called PD disaggregation, per-GPU token throughput increases up to 7x, according to @SemiAnalysis_. This directly reduces cost per million tokens by up to 7x.

The work is credited to @inferact and the vLLM open-source project for the inference engine, and to @NVIDIADC and @KranenKyle for the Dynamo inference orchestrator. PD disaggregation separates the prefill and decode phases of transformer inference across different GPUs, a technique previously shown to improve utilization in large-scale deployments but now demonstrated on B200 clusters with commodity Ethernet networking.

Why PD Disaggregation Matters Now

The unique take: PD disaggregation has been discussed in research (e.g., Patel et al. 2024) but this is the first public benchmark showing it on B200 hardware with RoCEv2 Ethernet rather than NVLink or InfiniBand. The use of Tomahawk switches and CX-7 NICs suggests the technique works on merchant silicon, not just NVIDIA's proprietary fabric. This lowers the barrier for operators building inference clusters without full NVLink backplanes.

The Cost Implication

If a B200 8-GPU node currently costs roughly $300,000, a 7x throughput improvement effectively drops per-token cost from ~$3 per million tokens to ~$0.43 per million tokens, assuming linear scaling. The exact figures were not disclosed by @SemiAnalysis_, but the proportional claim is unambiguous.

What's Next

@SemiAnalysis_ signals "more improvements to disagg b200 perf to come," suggesting ongoing optimization work. Watch for NVIDIA's GTC 2026 in March for potential Dynamo updates or official B200 PD disaggregation benchmarks.

What to watch

Watch for NVIDIA GTC 2026 in March for official Dynamo updates and B200 PD disaggregation benchmarks. Also track vLLM releases for merged PD disaggregation support — the open-source engine's adoption could accelerate inference cluster cost reductions.

Source: gentic.news · · author= · citation.json

AI-assisted reporting. Generated by gentic.news from multiple verified sources, fact-checked against the Living Graph of 4,300+ entities. Edited by Ala SMITH.

Following this story?

Get a weekly digest with AI predictions, trends, and analysis — free.

AI Analysis

This is a rare public quantification of PD disaggregation on B200 hardware, and the use of RoCEv2 Ethernet rather than NVLink is notable. Most disaggregation work has assumed InfiniBand or NVLink; demonstrating 7x gains on merchant silicon suggests the technique is more portable than previously assumed. However, the 7x figure may be peak or synthetic — real-world latency-sensitive workloads may see lower gains. The involvement of vLLM, an open-source engine, means these gains are accessible to any operator, not just NVIDIA's cloud partners. The Dynamo orchestrator from NVIDIA DC adds a proprietary layer, but the engine itself is open. The claim needs independent verification; @SemiAnalysis_ has been accurate on hardware performance in the past but this is a single source.
Compare side-by-side
B200 vs vLLM
Enjoyed this article?
Share:

AI Toolslive

Five one-click lenses on this article. Cached for 24h.

Pick a tool above to generate an instant lens on this article.

Related Articles

From the lab

The framework underneath this story

Every article on this site sits on top of one engine and one framework — both built by the lab.

More in AI Research

View all