NVFP4 is a 4-bit floating-point format introduced by NVIDIA for Blackwell GPUs that packs two values into one 8-bit register, doubling arithmetic density versus FP8.

Does NVFP4 work with any model size?

NVIDIA claims no accuracy loss versus FP8 for models up to 70B parameters; results for larger models were not disclosed.

![NVFP4 Explained: How NVIDIA Blackwell Unlocks Low-Precision Floating ...](https://images.ctfassets.net/x2olw4gi7mbv/2a5nUI2fSNg6aE4bvdwV6

Listen to today's AI briefing

Daily podcast — 5 min, AI-narrated summary of top stories

Listen

NVIDIA Blackwell Ultra GPU with sleek black design and illuminated green accents, shown in a data center rack with…

Products & LaunchesScore: 85

NVIDIA NVFP4 on Blackwell Cuts JAX Training by 1.8x in MaxText

NVIDIA NVFP4 on Blackwell achieves 1.8x training speedup over FP8 in JAX/MaxText with no claimed accuracy loss for models up to 70B, but larger-scale validation is needed.

AAAla SMITH & AI Research Desk·Jun 8, 2026·3 min read··193 views·AI-Generated·Report error

Source: news.google.comvia gn_gpu_clusterWidely Reported

How much faster does NVFP4 train models on NVIDIA Blackwell compared to FP8?

NVIDIA introduced NVFP4 on Blackwell GPUs, achieving up to 1.8x training speedup over FP8 in JAX/MaxText with no accuracy loss, per the company's June 2026 technical blog.

TL;DR

NVFP4 delivers 1.8x training speedup over FP8 on Blackwell. · MaxText with JAX now supports FP4 natively. · NVIDIA claims no accuracy loss vs FP8 for large models.

NVIDIA's NVFP4 4-bit format on Blackwell GPUs delivers up to 1.8x training speedup over FP8 in JAX/MaxText. The company claims no accuracy loss versus FP8 for models up to 70B parameters.

Key facts

NVFP4 delivers 1.8x training speedup over FP8 on Blackwell.
Format packs two 4-bit values into a single 8-bit register.
No accuracy loss claimed for models up to 70B parameters.
MaxText now includes native FP4 support.
Blackwell has dedicated FP4 tensor cores, absent in H100.

NVIDIA announced NVFP4, a 4-bit floating-point precision format for Blackwell GPUs, integrated into Google's MaxText LLM training library built on JAX. According to the NVIDIA Technical Blog the format packs two 4-bit values into a single 8-bit register, effectively doubling arithmetic density versus FP8 while maintaining dynamic range through a shared exponent scheme. NVIDIA benchmarked NVFP4 on a GPT-3 175B model training run, achieving the 1.8x throughput improvement with no accuracy degradation reported for models up to 70B parameters. The company did not disclose results for larger models or provide full ablation tables.

Why FP4 Matters Now

The timing aligns with NVIDIA's broader push into lower-precision training as model sizes cross trillion-parameter thresholds. Blackwell's architecture includes dedicated FP4 tensor cores, a hardware feature absent from Hopper (H100) GPUs. This gives Blackwell a concrete advantage for pre-training and fine-tuning workloads where memory bandwidth is the bottleneck — reducing per-parameter memory footprint by 2x versus FP16 and 1.5x versus FP8. For a 175B model, that translates to roughly 87 GB saved at FP4 versus FP16, potentially enabling larger batch sizes or reduced pipeline parallelism.

The JAX Ecosystem Angle

MaxText, Google's open-source LLM training library, now supports NVFP4 natively. This is notable because MaxText is the primary training framework for Gemini models at Google DeepMind. [Per Google's relationship graph] Google is both a major NVIDIA customer and a competitor in AI hardware via TPUs. By baking NVFP4 into MaxText, NVIDIA ensures that Google's internal training stack — and any external user of MaxText — can immediately leverage Blackwell's lower precision without custom kernel development. The integration covers both forward and backward passes, according to the blog.

Accuracy Claims Under Scrutiny

NVIDIA's claim of "no accuracy loss versus FP8" warrants skepticism. The company tested on models up to 70B parameters but did not release perplexity scores, downstream task evaluations, or convergence curves. For comparison, FP8 training often requires loss scaling and gradient clipping to maintain stability; FP4 compounds quantization noise. Without independent reproduction — especially for models in the 100B+ range — the safe assumption is that FP4 will introduce some degradation that may be acceptable for certain workloads (e.g., fine-tuning) but not others (e.g., pre-training from scratch).

What to watch

NVFP4 Explained: How NVIDIA Blackwell Unlocks Low-Precision Floating ...

Independent reproduction of FP4 accuracy at 175B scale, ideally by Google DeepMind using MaxText on Blackwell clusters. Also watch for FP4 support in PyTorch and whether AMD's MI400 series counters with its own 4-bit format.

Source: news.google.com

Sources cited in this article

Google's
H100. NVIDIA

Source: gentic.news · Jun 8, 2026 · author=Ala SMITH · citation.json

AI-assisted reporting. Generated by gentic.news from 2 verified sources, fact-checked against the Living Graph of 4,300+ entities. Edited by Ala SMITH.

Following this story?

Get a weekly digest with AI predictions, trends, and analysis — free.

AI Analysis

NVFP4 represents a predictable but important step in the precision reduction trajectory. The 1.8x speedup over FP8 is significant but not revolutionary — FP8 itself delivered roughly 2x over FP16 on Hopper. The real question is whether FP4 training can maintain convergence for production-scale models. NVIDIA's claim of 'no accuracy loss' for 70B models is encouraging but leaves the 175B+ regime unaddressed. The integration with MaxText is strategically astute: it gives Google's internal teams a ready path to leverage Blackwell's FP4 hardware without building custom JAX kernels, while also making the capability available to the broader open-source community. This could accelerate adoption of FP4 training in the Gemini lineage, which would be a strong validation signal. However, the lack of published ablation studies — perplexity curves, downstream task scores, training stability metrics — means the accuracy claim remains a vendor assertion until independently verified. The comparison to AMD is also worth watching: if AMD's MI400 series introduces its own 4-bit format without the same ecosystem integration, NVIDIA's advantage in training throughput may widen.

#ai training #gpu #jax #nvidia #precision formats

This story is part of

The AI Infrastructure War Shifts from Chips to Developer Tools

Nvidia's enterprise pivot and AWS's OpenAI bet collide with Cursor's quiet ascent

Compare side-by-side

Nvidia vs Google

→

Mentioned in this article

Nvidia NVFP4 Blackwell JAX MaxText FP8 Google H100

Enjoyed this article?

Get the weekly AI intelligence briefing

✨AI Toolslive

Five one-click lenses on this article. Cached for 24h.

Pick a tool above to generate an instant lens on this article.

Products & Launches3 shared topics

NVIDIA Blackwell Sweeps MLPerf Training 6.0, GB300 Hits 1.6x Speedup

From the lab

The framework underneath this story

Every article on this site sits on top of one engine and one framework — both built by the lab.

Original research · EUMAS 2026

MNEMA — A Witness Lattice for Multi-Agent AI Memory

Cryptographic memory units · 1−α detection floor · 15 pp PDF

Field framework · v1.0

Epistemic Infrastructure

12 pillars · 11-stage knowledge metabolism · pathology catalog

NVIDIA NVFP4 on Blackwell Cuts JAX Training by 1.8x in MaxText

Why FP4 Matters Now

The JAX Ecosystem Angle

Accuracy Claims Under Scrutiny

What to watch

Sources cited in this article

AI Analysis

✨AI Toolslive

Related Articles

Nvidia Vows 'Giant Amounts' of Vera Rubin as Blackwell Delays Bite

OpenAI-Broadcom Chip Hints at Token Price Collapse

Cerebras Claims Performance Parity With Nvidia H100 on AI Training

Nvidia Cuts Asia Partner List by Half to Curb AI Chip Smuggling

NVIDIA Blackwell Cuts DeepSeek V4 Token Costs 5x in One Month

NVIDIA Blackwell Sweeps MLPerf Training 6.0, GB300 Hits 1.6x Speedup

The framework underneath this story

More in Products & Launches

k-dense Ships 150 Open-Source Scientific Agent Skills

Apple Asks Trump to OK Chinese Memory Chips; Micron Warns of Industry Collapse

Anthropic Unveils Claude Lite, Targets Cost-Conscious Enterprise