Skip to content
gentic.news — AI News Intelligence Platform
Connecting to the Living Graph…

Listen to today's AI briefing

Daily podcast — 5 min, AI-narrated summary of top stories

NVIDIA Blackwell Ultra GPU with sleek black design and illuminated green accents, shown in a data center rack with…

NVIDIA NVFP4 on Blackwell Cuts JAX Training by 1.8x in MaxText

NVIDIA NVFP4 on Blackwell achieves 1.8x training speedup over FP8 in JAX/MaxText with no claimed accuracy loss for models up to 70B, but larger-scale validation is needed.

·1d ago·3 min read··2 views·AI-Generated·Report error
Share:
Source: news.google.comvia gn_gpu_clusterSingle Source
How much faster does NVFP4 train models on NVIDIA Blackwell compared to FP8?

NVIDIA introduced NVFP4 on Blackwell GPUs, achieving up to 1.8x training speedup over FP8 in JAX/MaxText with no accuracy loss, per the company's June 2026 technical blog.

TL;DR

NVFP4 delivers 1.8x training speedup over FP8 on Blackwell. · MaxText with JAX now supports FP4 natively. · NVIDIA claims no accuracy loss vs FP8 for large models.

NVIDIA's NVFP4 4-bit format on Blackwell GPUs delivers up to 1.8x training speedup over FP8 in JAX/MaxText. The company claims no accuracy loss versus FP8 for models up to 70B parameters.

Key facts

  • NVFP4 delivers 1.8x training speedup over FP8 on Blackwell.
  • Format packs two 4-bit values into a single 8-bit register.
  • No accuracy loss claimed for models up to 70B parameters.
  • MaxText now includes native FP4 support.
  • Blackwell has dedicated FP4 tensor cores, absent in H100.

NVIDIA announced NVFP4, a 4-bit floating-point precision format for Blackwell GPUs, integrated into Google's MaxText LLM training library built on JAX. According to the NVIDIA Technical Blog the format packs two 4-bit values into a single 8-bit register, effectively doubling arithmetic density versus FP8 while maintaining dynamic range through a shared exponent scheme. NVIDIA benchmarked NVFP4 on a GPT-3 175B model training run, achieving the 1.8x throughput improvement with no accuracy degradation reported for models up to 70B parameters. The company did not disclose results for larger models or provide full ablation tables.

Why FP4 Matters Now

The timing aligns with NVIDIA's broader push into lower-precision training as model sizes cross trillion-parameter thresholds. Blackwell's architecture includes dedicated FP4 tensor cores, a hardware feature absent from Hopper (H100) GPUs. This gives Blackwell a concrete advantage for pre-training and fine-tuning workloads where memory bandwidth is the bottleneck — reducing per-parameter memory footprint by 2x versus FP16 and 1.5x versus FP8. For a 175B model, that translates to roughly 87 GB saved at FP4 versus FP16, potentially enabling larger batch sizes or reduced pipeline parallelism.

The JAX Ecosystem Angle

MaxText, Google's open-source LLM training library, now supports NVFP4 natively. This is notable because MaxText is the primary training framework for Gemini models at Google DeepMind. [Per Google's relationship graph] Google is both a major NVIDIA customer and a competitor in AI hardware via TPUs. By baking NVFP4 into MaxText, NVIDIA ensures that Google's internal training stack — and any external user of MaxText — can immediately leverage Blackwell's lower precision without custom kernel development. The integration covers both forward and backward passes, according to the blog.

Accuracy Claims Under Scrutiny

NVIDIA's claim of "no accuracy loss versus FP8" warrants skepticism. The company tested on models up to 70B parameters but did not release perplexity scores, downstream task evaluations, or convergence curves. For comparison, FP8 training often requires loss scaling and gradient clipping to maintain stability; FP4 compounds quantization noise. Without independent reproduction — especially for models in the 100B+ range — the safe assumption is that FP4 will introduce some degradation that may be acceptable for certain workloads (e.g., fine-tuning) but not others (e.g., pre-training from scratch).

What to watch

Independent reproduction of FP4 accuracy at 175B scale, ideally by Google DeepMind using MaxText on Blackwell clusters. Also watch for FP4 support in PyTorch and whether AMD's MI400 series counters with its own 4-bit format.


Source: news.google.com


Sources cited in this article

  1. H100. NVIDIA
Source: gentic.news · · author= · citation.json

AI-assisted reporting. Generated by gentic.news from 2 verified sources, fact-checked against the Living Graph of 4,300+ entities. Edited by Ala SMITH.

Following this story?

Get a weekly digest with AI predictions, trends, and analysis — free.

AI Analysis

NVFP4 represents a predictable but important step in the precision reduction trajectory. The 1.8x speedup over FP8 is significant but not revolutionary — FP8 itself delivered roughly 2x over FP16 on Hopper. The real question is whether FP4 training can maintain convergence for production-scale models. NVIDIA's claim of 'no accuracy loss' for 70B models is encouraging but leaves the 175B+ regime unaddressed. The integration with MaxText is strategically astute: it gives Google's internal teams a ready path to leverage Blackwell's FP4 hardware without building custom JAX kernels, while also making the capability available to the broader open-source community. This could accelerate adoption of FP4 training in the Gemini lineage, which would be a strong validation signal. However, the lack of published ablation studies — perplexity curves, downstream task scores, training stability metrics — means the accuracy claim remains a vendor assertion until independently verified. The comparison to AMD is also worth watching: if AMD's MI400 series introduces its own 4-bit format without the same ecosystem integration, NVIDIA's advantage in training throughput may widen.
Compare side-by-side
Nvidia vs Google
Enjoyed this article?
Share:

AI Toolslive

Five one-click lenses on this article. Cached for 24h.

Pick a tool above to generate an instant lens on this article.

Related Articles

From the lab

The framework underneath this story

Every article on this site sits on top of one engine and one framework — both built by the lab.

More in Products & Launches

View all