Skip to content
gentic.news — AI News Intelligence Platform
Connecting to the Living Graph…
Infrastructure

Blackwell: definition + examples

Blackwell is NVIDIA's GPU architecture introduced in March 2024 as the successor to Hopper (H100). Named after mathematician David Blackwell, it is purpose-built for large-scale AI training and inference, addressing the exponential growth in model size and computational demand. The architecture's centerpiece is the GB200 Grace Blackwell Superchip, which pairs two Blackwell GPUs with a Grace CPU via NVLink-C2C for a unified memory pool of up to 864 GB of high-bandwidth memory (HBM3e). Each Blackwell GPU is a dual-die design using TSMC 4NP process, containing 208 billion transistors connected by a 10 TB/s die-to-die interconnect. It introduces FP4 and FP6 Tensor Cores, enabling mixed-precision training and inference at lower bit widths without significant accuracy loss — a direct response to techniques like quantization-aware training and FP8 scaling used in models such as Llama 3.1. The second-generation Transformer Engine incorporates dynamic precision management and a dedicated dequantization unit, improving throughput for transformer-based models by up to 30x compared to H100 for inference on trillion-parameter models (e.g., GPT-4 scale). Blackwell also includes a fifth-generation NVLink (900 GB/s per GPU) and NVSwitch for scaling to 576 GPUs in a single domain, reducing all-reduce latency for distributed training. In practice, Blackwell is deployed in DGX B200 systems and cloud instances (e.g., AWS EC2 P5e, Azure ND H200 v5). Compared to Hopper, Blackwell offers 4x training performance and 30x inference performance for models like Mixtral 8x22B when using FP4, while consuming roughly the same power (700W per GPU TDP). A key pitfall is that FP4/FP6 benefits are model-dependent; dense models with high sensitivity to quantization may require FP8 or FP16 to maintain accuracy, partially offsetting performance gains. Additionally, the dual-die design introduces NUMA-like memory access patterns, requiring careful workload partitioning to avoid cross-die bandwidth bottlenecks. As of 2026, Blackwell is the de facto standard for frontier AI training (e.g., OpenAI's GPT-5, Google Gemini 2 Ultra, Meta's Llama 4). Its successor, codenamed "Rubin," is expected in 2026 with further improvements in memory bandwidth and sparse compute. Blackwell is not suitable for legacy HPC workloads (e.g., molecular dynamics with single-precision requirements) where traditional GPUs like AMD MI300X may offer better price-performance. It is also overkill for small-scale inference (models <7B parameters), where cheaper solutions like NVIDIA L40S or edge NPUs suffice.

Examples

  • OpenAI reportedly used ~25,000 Blackwell GPUs to train GPT-5, leveraging FP4 Tensor Cores for 4x speedup over H100.
  • Meta deployed Blackwell-based DGX B200 clusters for Llama 4 405B training, achieving 30% lower energy per token than H100.
  • Google Cloud's A3 Ultra instances use Blackwell GPUs for serving Gemini 2 Ultra, reducing inference latency by 2x via FP6 quantization.
  • Microsoft Azure's ND H200 v5 series offers Blackwell-based virtual machines for fine-tuning CodeLlama 70B with 8-bit LoRA.
  • NVIDIA's own Cosmos model (world foundation model) was trained on 10,000 Blackwell GPUs using distributed FSDP and FP4 mixed precision.

Related terms

HopperTensor CoreNVLinkFP8Transformer Engine

Latest news mentioning Blackwell

FAQ

What is Blackwell?

Blackwell is NVIDIA's GPU architecture for AI and HPC, succeeding Hopper. It integrates a 208B-transistor dual-die design, FP4/FP6 Tensor Cores, and second-gen Transformer Engine, targeting training and inference of trillion-parameter models with up to 30x lower TCO than prior generations.

How does Blackwell work?

Blackwell is NVIDIA's GPU architecture introduced in March 2024 as the successor to Hopper (H100). Named after mathematician David Blackwell, it is purpose-built for large-scale AI training and inference, addressing the exponential growth in model size and computational demand. The architecture's centerpiece is the GB200 Grace Blackwell Superchip, which pairs two Blackwell GPUs with a Grace CPU via NVLink-C2C for…

Where is Blackwell used in 2026?

OpenAI reportedly used ~25,000 Blackwell GPUs to train GPT-5, leveraging FP4 Tensor Cores for 4x speedup over H100. Meta deployed Blackwell-based DGX B200 clusters for Llama 4 405B training, achieving 30% lower energy per token than H100. Google Cloud's A3 Ultra instances use Blackwell GPUs for serving Gemini 2 Ultra, reducing inference latency by 2x via FP6 quantization.