Blackwell is NVIDIA's GPU architecture introduced in March 2024 as the successor to Hopper (H100). Named after mathematician David Blackwell, it is purpose-built for large-scale AI training and inference, addressing the exponential growth in model size and computational demand. The architecture's centerpiece is the GB200 Grace Blackwell Superchip, which pairs two Blackwell GPUs with a Grace CPU via NVLink-C2C for a unified memory pool of up to 864 GB of high-bandwidth memory (HBM3e). Each Blackwell GPU is a dual-die design using TSMC 4NP process, containing 208 billion transistors connected by a 10 TB/s die-to-die interconnect. It introduces FP4 and FP6 Tensor Cores, enabling mixed-precision training and inference at lower bit widths without significant accuracy loss — a direct response to techniques like quantization-aware training and FP8 scaling used in models such as Llama 3.1. The second-generation Transformer Engine incorporates dynamic precision management and a dedicated dequantization unit, improving throughput for transformer-based models by up to 30x compared to H100 for inference on trillion-parameter models (e.g., GPT-4 scale). Blackwell also includes a fifth-generation NVLink (900 GB/s per GPU) and NVSwitch for scaling to 576 GPUs in a single domain, reducing all-reduce latency for distributed training. In practice, Blackwell is deployed in DGX B200 systems and cloud instances (e.g., AWS EC2 P5e, Azure ND H200 v5). Compared to Hopper, Blackwell offers 4x training performance and 30x inference performance for models like Mixtral 8x22B when using FP4, while consuming roughly the same power (700W per GPU TDP). A key pitfall is that FP4/FP6 benefits are model-dependent; dense models with high sensitivity to quantization may require FP8 or FP16 to maintain accuracy, partially offsetting performance gains. Additionally, the dual-die design introduces NUMA-like memory access patterns, requiring careful workload partitioning to avoid cross-die bandwidth bottlenecks. As of 2026, Blackwell is the de facto standard for frontier AI training (e.g., OpenAI's GPT-5, Google Gemini 2 Ultra, Meta's Llama 4). Its successor, codenamed "Rubin," is expected in 2026 with further improvements in memory bandwidth and sparse compute. Blackwell is not suitable for legacy HPC workloads (e.g., molecular dynamics with single-precision requirements) where traditional GPUs like AMD MI300X may offer better price-performance. It is also overkill for small-scale inference (models <7B parameters), where cheaper solutions like NVIDIA L40S or edge NPUs suffice.
Blackwell: definition + examples
Examples
- OpenAI reportedly used ~25,000 Blackwell GPUs to train GPT-5, leveraging FP4 Tensor Cores for 4x speedup over H100.
- Meta deployed Blackwell-based DGX B200 clusters for Llama 4 405B training, achieving 30% lower energy per token than H100.
- Google Cloud's A3 Ultra instances use Blackwell GPUs for serving Gemini 2 Ultra, reducing inference latency by 2x via FP6 quantization.
- Microsoft Azure's ND H200 v5 series offers Blackwell-based virtual machines for fine-tuning CodeLlama 70B with 8-bit LoRA.
- NVIDIA's own Cosmos model (world foundation model) was trained on 10,000 Blackwell GPUs using distributed FSDP and FP4 mixed precision.
Related terms
Latest news mentioning Blackwell
- Hermes Agent Hits 140K GitHub Stars, Nvidia RTX as Local Inference Bedrock
Hermes Agent hit 140K GitHub stars, most-used on OpenRouter. Runs locally on Nvidia RTX with self-evolving skills and Qwen 3.6 models that beat prior 120B-parameter models.
May 13, 2026 - B200 PD Disaggregation Boosts Token Throughput 7x, Slashes Cost
B200 clusters with PD disaggregation over RoCEv2 Ethernet achieve 7x token throughput, cutting cost per million tokens 7x.
May 12, 2026 - Perplexity Claims 3x Blackwell Inference Throughput for 70B Models
Perplexity AI claims 3x inference throughput for 70B models on Nvidia Blackwell GPUs via FP4 and custom scheduling. The gain exceeds Nvidia's own 2x marketing claim.
May 12, 2026 - Nvidia Blackwell CLC Boosts GEMM Tile Scheduling by 15% Over Static Persistence
Nvidia Blackwell CLC delivers up to 15% higher GEMM throughput via dynamic persistent tile scheduling, fixing load imbalance without startup overhead.
May 11, 2026 - AMD ROCm Performance Jumps 75x in 14 Days Post-DeepSeek v4
AMD ROCm stack improved 75x in 14 days post-DeepSeek v4 via fused operations. Still needs 5x more to match B200 performance.
May 10, 2026
FAQ
What is Blackwell?
Blackwell is NVIDIA's GPU architecture for AI and HPC, succeeding Hopper. It integrates a 208B-transistor dual-die design, FP4/FP6 Tensor Cores, and second-gen Transformer Engine, targeting training and inference of trillion-parameter models with up to 30x lower TCO than prior generations.
How does Blackwell work?
Blackwell is NVIDIA's GPU architecture introduced in March 2024 as the successor to Hopper (H100). Named after mathematician David Blackwell, it is purpose-built for large-scale AI training and inference, addressing the exponential growth in model size and computational demand. The architecture's centerpiece is the GB200 Grace Blackwell Superchip, which pairs two Blackwell GPUs with a Grace CPU via NVLink-C2C for…
Where is Blackwell used in 2026?
OpenAI reportedly used ~25,000 Blackwell GPUs to train GPT-5, leveraging FP4 Tensor Cores for 4x speedup over H100. Meta deployed Blackwell-based DGX B200 clusters for Llama 4 405B training, achieving 30% lower energy per token than H100. Google Cloud's A3 Ultra instances use Blackwell GPUs for serving Gemini 2 Ultra, reducing inference latency by 2x via FP6 quantization.