Listen to today's AI briefing

Daily podcast — 5 min, AI-narrated summary of top stories

NVIDIA B200 8-GPU servers connected via RoCEv2 CX-7 Ethernet and Tomahawk switches, with a diagram showing prefill…

B200 PD Disaggregation Boosts Token Throughput 7x, Slashes Cost

B200 clusters with PD disaggregation over RoCEv2 Ethernet achieve 7x token throughput, cutting cost per million tokens 7x.

AAAla SMITH & AI Research Desk·4h ago·2 min read··5 views·AI-Generated·Report error

Source: x.comvia @SemiAnalysis_Single Source

How does PD disaggregation improve B200 inference throughput and cost?

NVIDIA B200 8-GPU machines ganged over RoCEv2 CX-7 Ethernet with Tomahawk switches and PD disaggregation yield up to 7x per-GPU token throughput, cutting cost per million tokens 7x. Built by inferact and vLLM with NVIDIA Dynamo.

TL;DR

B200 8-GPU machines with RoCEv2 CX-7 Ethernet · PD disaggregation increases per-GPU throughput up to 7x · Cost per million tokens drops up to 7x

NVIDIA B200 8-GPU machines ganged over RoCEv2 CX-7 Ethernet with Tomahawk switches achieve up to 7x per-GPU token throughput. PD disaggregation, a technique separating prefill and decode, drives the gain and cuts cost per million tokens by 7x.

Key facts

7x per-GPU token throughput increase with PD disaggregation
7x cost reduction per million tokens
Uses RoCEv2 CX-7 Ethernet and Tomahawk switches
Built by inferact, vLLM, NVIDIA Dynamo
First public B200 PD disaggregation benchmark

By clustering multiple B200 8-GPU nodes over RoCEv2 CX-7 Ethernet using Tomahawk switches and applying an inference optimization called PD disaggregation, per-GPU token throughput increases up to 7x, according to @SemiAnalysis_. This directly reduces cost per million tokens by up to 7x.

The work is credited to @inferact and the vLLM open-source project for the inference engine, and to @NVIDIADC and @KranenKyle for the Dynamo inference orchestrator. PD disaggregation separates the prefill and decode phases of transformer inference across different GPUs, a technique previously shown to improve utilization in large-scale deployments but now demonstrated on B200 clusters with commodity Ethernet networking.

Why PD Disaggregation Matters Now

The unique take: PD disaggregation has been discussed in research (e.g., Patel et al. 2024) but this is the first public benchmark showing it on B200 hardware with RoCEv2 Ethernet rather than NVLink or InfiniBand. The use of Tomahawk switches and CX-7 NICs suggests the technique works on merchant silicon, not just NVIDIA's proprietary fabric. This lowers the barrier for operators building inference clusters without full NVLink backplanes.

The Cost Implication

If a B200 8-GPU node currently costs roughly $300,000, a 7x throughput improvement effectively drops per-token cost from ~$3 per million tokens to ~$0.43 per million tokens, assuming linear scaling. The exact figures were not disclosed by @SemiAnalysis_, but the proportional claim is unambiguous.

What's Next

@SemiAnalysis_ signals "more improvements to disagg b200 perf to come," suggesting ongoing optimization work. Watch for NVIDIA's GTC 2026 in March for potential Dynamo updates or official B200 PD disaggregation benchmarks.

What to watch

Watch for NVIDIA GTC 2026 in March for official Dynamo updates and B200 PD disaggregation benchmarks. Also track vLLM releases for merged PD disaggregation support — the open-source engine's adoption could accelerate inference cluster cost reductions.

Source: gentic.news · 4h ago · author=Ala SMITH · citation.json

AI-assisted reporting. Generated by gentic.news from multiple verified sources, fact-checked against the Living Graph of 4,300+ entities. Edited by Ala SMITH.

Following this story?

Get a weekly digest with AI predictions, trends, and analysis — free.

AI Analysis

This is a rare public quantification of PD disaggregation on B200 hardware, and the use of RoCEv2 Ethernet rather than NVLink is notable. Most disaggregation work has assumed InfiniBand or NVLink; demonstrating 7x gains on merchant silicon suggests the technique is more portable than previously assumed. However, the 7x figure may be peak or synthetic — real-world latency-sensitive workloads may see lower gains. The involvement of vLLM, an open-source engine, means these gains are accessible to any operator, not just NVIDIA's cloud partners. The Dynamo orchestrator from NVIDIA DC adds a proprietary layer, but the engine itself is open. The claim needs independent verification; @SemiAnalysis_ has been accurate on hardware performance in the past but this is a single source.

#open-source #hardware #nvidia #inference

Compare side-by-side

B200 vs vLLM

→

Mentioned in this article

B200 PD disaggregation vLLM NVIDIA Dynamo inferact

Enjoyed this article?

Get the weekly AI intelligence briefing

✨AI Toolslive

Five one-click lenses on this article. Cached for 24h.

Pick a tool above to generate an instant lens on this article.

AI Research

Anthropic Teaches Claude Why: New Interpretability Method Deployed

From the lab

The framework underneath this story

Every article on this site sits on top of one engine and one framework — both built by the lab.

Original research · EUMAS 2026

MNEMA — A Witness Lattice for Multi-Agent AI Memory

Cryptographic memory units · 1−α detection floor · 15 pp PDF

Field framework · v1.0

Epistemic Infrastructure

12 pillars · 11-stage knowledge metabolism · pathology catalog

More in AI Research

View all

Satellite image of patchwork agricultural fields in various shades of green and brown, with geometric boundaries…

AI Research

Prithvi-EO Fails Cross-Country Crop Yield Generalization, Paper Shows

Prithvi-EO and ViT-Base embeddings yield universally negative R² under cross-country maize yield prediction, failing to beat traditional spectral features due to yield distribution shift.

arxiv.org/17h ago/3 min read

earth-observationfoundation-modelsarxiv

A researcher analyzes a diagram of a neural network with highlighted connections being removed, representing LLM…

AI Research

Pruning LLMs for Edge Triples Bias, Perplexity Hides Damage

Pruning LLMs for edge deployment amplifies bias up to 83.7% while perplexity barely changes, revealing a paradox that undermines standard evaluation practices.

arxiv.org/17h ago/3 min read

ai safetymodel compressionedge ai

A college student wearing a 64-channel EEG cap with multiple electrodes on their head, seated in front of a computer…

AI Research

TikTok Brain Has an EEG Signature: Frontal Theta Drops 0.395

Zhejiang University EEG study finds 0.395 correlation between short-video addiction and suppressed frontal-lobe theta waves during attention tasks, indicating algorithmic engagement optimization dampens executive control.

x.com/1d ago/3 min read

social-media-effectsrecommendation-systemsattention

Why PD Disaggregation Matters Now

The Cost Implication

What's Next

What to watch

AI Analysis

✨AI Toolslive

Related Articles

RRCM Uses GRPO to Decide When to Retrieve for LLM Recommendation

Simple Graph Heuristic Beats Generative Recommenders on 10 of 14 Benchmarks

Claude Code's Six-Layer Architecture: Harness, Not Magic

MCP vs CLI Debate Resolved by Anthropic's Code Mode: 98.7% Token Drop

Two-Tower vs Vector DB + LLM: Which Wins for RecSys at Scale?

Anthropic Teaches Claude Why: New Interpretability Method Deployed

The framework underneath this story

More in AI Research

Prithvi-EO Fails Cross-Country Crop Yield Generalization, Paper Shows

Pruning LLMs for Edge Triples Bias, Perplexity Hides Damage

TikTok Brain Has an EEG Signature: Frontal Theta Drops 0.395