Skip to content
gentic.news — AI News Intelligence Platform
Connecting to the Living Graph…

Listen to today's AI briefing

Daily podcast — 5 min, AI-narrated summary of top stories

A data center filled with rows of NVIDIA Blackwell GPU servers, blue indicator lights glowing on the hardware…
AI ResearchBreakthroughScore: 89

NVIDIA Blackwell Sweeps MLPerf Training 6.0, GB300 Hits 1.6x Speedup

NVIDIA Blackwell swept MLPerf Training 6.0 across all seven benchmarks. GB300 NVL72 delivered 1.6x speedup over GB200 NVL72 using NVFP4 and 8,192 GPUs.

·6h ago·3 min read··6 views·AI-Generated·Report error
Share:
Source: blogs.nvidia.comvia nvidia_dc_blog, gn_gpu_clusterCorroborated
Did NVIDIA Blackwell win MLPerf Training 6.0?

NVIDIA Blackwell platform swept MLPerf Training 6.0, winning all seven benchmarks. The GB300 NVL72 delivered up to 1.6x faster training than GB200 NVL72, using NVFP4 precision and 8,192 GPUs for DeepSeek-V3 671B.

TL;DR

NVIDIA Blackwell won all 7 MLPerf Training 6.0 benchmarks. · GB300 NVL72 delivered up to 1.6x faster training than GB200. · DeepSeek-V3 671B trained on 8,192 GPUs via NVLink.

NVIDIA Blackwell swept all seven benchmarks in MLPerf Training 6.0. The GB300 NVL72 delivered up to 1.6x faster training than GB200 NVL72, using NVFP4 precision across 8,192 GPUs.

Key facts

  • NVIDIA won all 7 benchmarks in MLPerf Training 6.0.
  • GB300 NVL72 achieved up to 1.6x faster training than GB200 NVL72.
  • deepseek-v3-671b" class="entity-chip">DeepSeek-V3 671B trained on 8,192 GPUs via NVLink.
  • New MoE workloads DeepSeek-V3 671B and GPT-OSS-20B added.
  • NVFP4 precision used for Nemotron 3 Ultra 550B-parameter model.

NVIDIA's Blackwell platform dominated MLPerf Training 6.0, the latest peer-reviewed industry benchmark suite for AI training performance, according to NVIDIA's blog post. The platform achieved the fastest time to train on every benchmark, including two new mixture-of-experts (MoE) pretraining workloads: DeepSeek-V3 671B and GPT-OSS-20B. NVIDIA was the only platform with submissions across all seven benchmarks in the suite.

The standout result came from the GB300 NVL72 rack-scale system, which delivered up to 1.6x faster training than the GB200 NVL72 at the same scale. Key Blackwell Ultra capabilities driving this improvement include higher compute density with NVFP4 precision, expanded memory capacity, and a higher power ceiling that lets the GPU sustain peak performance. NVIDIA also showcased NVFP4 training methods that increase performance while meeting strict accuracy requirements across large- and small-scale pretraining as well as fine-tuning workloads.

MoE Training at Scale

Large-scale MoE training faces the same all-to-all communication challenge as MoE inference — tokens must be routed across GPUs to reach the right expert subnetwork. NVIDIA's fifth-generation NVLink Switches connect all 72 GPUs within each rack-scale system with high bandwidth into a unified pool of compute and memory, enabling them to act as one giant GPU. [According to NVIDIA], this NVLink bandwidth advantage is what makes MoE training fast and efficient at scale.

To support distributed training at scale, NVIDIA offers two complementary scale-out networking platforms — NVIDIA Quantum InfiniBand and NVIDIA Spectrum-X Ethernet — giving data centers flexibility to build large-scale clusters optimized for their infrastructure. On DeepSeek-V3 671B, NVIDIA submitted results using 8,192 GPUs, the largest Blackwell cluster in MLPerf Training history.

Historical Context and Competition

This sweep comes as NVIDIA faces increasing competition from custom silicon and alternative architectures. Google's TPU v6, AMD's MI400, and Cerebras CS-3 have all posted competitive results in previous MLPerf rounds. However, NVIDIA's ability to deliver both the fastest single-system performance and the largest-scale distributed training results — while being the only vendor to submit across all benchmarks — reinforces its dominant position in AI training infrastructure.

The GB300 NVL72's 1.6x speedup over the GB200 NVL72 is particularly notable given that Blackwell was only introduced in early 2026. This rapid generational improvement suggests NVIDIA's engineering cadence remains aggressive, likely driven by Jensen Huang's directive to maintain a one-year architecture cycle.

What to Watch

Watch for the MLPerf Inference 7.0 results expected in Q4 2026, where NVIDIA will face pressure from AMD's MI400 and Google's TPU v6 on latency-sensitive workloads. Also monitor whether CoreWeave or other cloud providers can replicate NVIDIA's 8,192-GPU DeepSeek-V3 training result on their own clusters, which would validate the scalability claims independently.

Watch NVIDIA CEO Jensen Huang's GTC Taipei Keynote Replay

How the UK Is Turning Sovereign AI Ambition Into Action With NVIDIA Technologies

Industrial Software Leaders Build Secure, Autonomous AI Engineers With NVIDIA NemoClaw


Source: blogs.nvidia.com

Key Takeaways

  • NVIDIA Blackwell swept MLPerf Training 6.0 across all seven benchmarks.
  • GB300 NVL72 delivered 1.6x speedup over GB200 NVL72 using NVFP4 and 8,192 GPUs.

Sources cited in this article

Source: gentic.news · · author= · citation.json

AI-assisted reporting. Generated by gentic.news from 1 verified source, fact-checked against the Living Graph of 4,300+ entities. Edited by Ala SMITH.

Following this story?

Get a weekly digest with AI predictions, trends, and analysis — free.

AI Analysis

NVIDIA's MLPerf sweep is less about raw performance and more about ecosystem lock-in. The 1.6x GB300 speedup over GB200 is impressive, but the real story is that NVIDIA was the only vendor to submit across all seven benchmarks. This breadth matters because MLPerf submissions require significant engineering effort — each benchmark demands custom kernel tuning, distributed strategy optimization, and validation. Competitors like AMD and Google cherry-pick workloads where their architectures shine, while NVIDIA demonstrates that its platform handles the full spectrum of training workloads from small fine-tuning to massive MoE pretraining. The NVFP4 precision story is particularly interesting. By reducing precision to 4-bit floating point while maintaining accuracy, NVIDIA effectively increases throughput without requiring larger clusters. This is a direct response to the compute-constrained environment where frontier models like DeepSeek-V3 require tens of thousands of GPUs. If NVFP4 can be generalized to other architectures, it could reshape training economics. However, the lack of independent validation is a gap. NVIDIA's results come from its own labs and cloud partners. Third-party verification from CoreWeave or Google Cloud would strengthen the claims. The DeepSeek-V3 671B result on 8,192 GPUs is the most impressive technical achievement here — MoE training at that scale requires solving complex routing and load-balancing problems that most vendors avoid.
Compare side-by-side
Blackwell vs GB300 NVL72
Enjoyed this article?
Share:

AI Toolslive

Five one-click lenses on this article. Cached for 24h.

Pick a tool above to generate an instant lens on this article.

Related Articles

From the lab

The framework underneath this story

Every article on this site sits on top of one engine and one framework — both built by the lab.

More in AI Research

View all