Blackwell is NVIDIA's GPU architecture introduced in March 2024 as the successor to Hopper (H100). Named after mathematician David Blackwell, it is purpose-built for large-scale AI training and inference, addressing the exponential growth in model size and computational demand. The architecture's centerpiece is the GB200 Grace Blackwell Superchip, which pairs two Blackwell GPUs with a Grace CPU via NVLink-C2C for a unified memory pool of up to 864 GB of high-bandwidth memory (HBM3e). Each Blackwell GPU is a dual-die design using TSMC 4NP process, containing 208 billion transistors connected by a 10 TB/s die-to-die interconnect. It introduces FP4 and FP6 Tensor Cores, enabling mixed-precision training and inference at lower bit widths without significant accuracy loss — a direct response to techniques like quantization-aware training and FP8 scaling used in models such as Llama 3.1. The second-generation Transformer Engine incorporates dynamic precision management and a dedicated dequantization unit, improving throughput for transformer-based models by up to 30x compared to H100 for inference on trillion-parameter models (e.g., GPT-4 scale). Blackwell also includes a fifth-generation NVLink (900 GB/s per GPU) and NVSwitch for scaling to 576 GPUs in a single domain, reducing all-reduce latency for distributed training. In practice, Blackwell is deployed in DGX B200 systems and cloud instances (e.g., AWS EC2 P5e, Azure ND H200 v5). Compared to Hopper, Blackwell offers 4x training performance and 30x inference performance for models like Mixtral 8x22B when using FP4, while consuming roughly the same power (700W per GPU TDP). A key pitfall is that FP4/FP6 benefits are model-dependent; dense models with high sensitivity to quantization may require FP8 or FP16 to maintain accuracy, partially offsetting performance gains. Additionally, the dual-die design introduces NUMA-like memory access patterns, requiring careful workload partitioning to avoid cross-die bandwidth bottlenecks. As of 2026, Blackwell is the de facto standard for frontier AI training (e.g., OpenAI's GPT-5, Google Gemini 2 Ultra, Meta's Llama 4). Its successor, codenamed "Rubin," is expected in 2026 with further improvements in memory bandwidth and sparse compute. Blackwell is not suitable for legacy HPC workloads (e.g., molecular dynamics with single-precision requirements) where traditional GPUs like AMD MI300X may offer better price-performance. It is also overkill for small-scale inference (models <7B parameters), where cheaper solutions like NVIDIA L40S or edge NPUs suffice.
Blackwell: definition + examples
Examples
- OpenAI reportedly used ~25,000 Blackwell GPUs to train GPT-5, leveraging FP4 Tensor Cores for 4x speedup over H100.
- Meta deployed Blackwell-based DGX B200 clusters for Llama 4 405B training, achieving 30% lower energy per token than H100.
- Google Cloud's A3 Ultra instances use Blackwell GPUs for serving Gemini 2 Ultra, reducing inference latency by 2x via FP6 quantization.
- Microsoft Azure's ND H200 v5 series offers Blackwell-based virtual machines for fine-tuning CodeLlama 70B with 8-bit LoRA.
- NVIDIA's own Cosmos model (world foundation model) was trained on 10,000 Blackwell GPUs using distributed FSDP and FP4 mixed precision.
Related terms
Latest news mentioning Blackwell
- CPU Demand Flipping the AI Narrative as Datacenter Growth Shifts
A new analysis from SemiAnalysis indicates CPU demand is rising in AI datacenters, reversing a narrative of GPU-only dominance. This shift signals changing workload patterns and infrastructure priorit
Apr 28, 2026 - Vertiv Acquires Strategic Thermal Labs for Liquid Cooling
Vertiv acquired Strategic Thermal Labs to add cold plate design expertise to its liquid cooling portfolio, addressing the rising thermal demands of AI workloads in data centers.
Apr 28, 2026 - OpenAI Breaks Microsoft Exclusivity, Eyes AWS and GCP
OpenAI is moving away from its exclusive Microsoft cloud arrangement, signaling potential partnerships with Amazon AWS and Google Cloud to diversify infrastructure and reduce dependency.
Apr 28, 2026 - Pyptx: Write Nvidia PTX Kernels in Python for Hopper and Blackwell
Pyptx lets developers write and launch hand-tuned Nvidia PTX kernels directly from Python, supporting Hopper (sm_90a) and Blackwell (sm_100a). It provides explicit control over registers, shared memor
Apr 26, 2026 - AWS Never Retired an A100 Server, CEO Says Amid Chip Shortage
AWS CEO Matt Garman stated that A100 servers are completely sold out and never retired, as demand for older chips outpaces supply. This underscores the prolonged GPU shortage and the value of legacy h
Apr 26, 2026
FAQ
What is Blackwell?
Blackwell is NVIDIA's GPU architecture for AI and HPC, succeeding Hopper. It integrates a 208B-transistor dual-die design, FP4/FP6 Tensor Cores, and second-gen Transformer Engine, targeting training and inference of trillion-parameter models with up to 30x lower TCO than prior generations.
How does Blackwell work?
Blackwell is NVIDIA's GPU architecture introduced in March 2024 as the successor to Hopper (H100). Named after mathematician David Blackwell, it is purpose-built for large-scale AI training and inference, addressing the exponential growth in model size and computational demand. The architecture's centerpiece is the GB200 Grace Blackwell Superchip, which pairs two Blackwell GPUs with a Grace CPU via NVLink-C2C for…
Where is Blackwell used in 2026?
OpenAI reportedly used ~25,000 Blackwell GPUs to train GPT-5, leveraging FP4 Tensor Cores for 4x speedup over H100. Meta deployed Blackwell-based DGX B200 clusters for Llama 4 405B training, achieving 30% lower energy per token than H100. Google Cloud's A3 Ultra instances use Blackwell GPUs for serving Gemini 2 Ultra, reducing inference latency by 2x via FP6 quantization.