The B100 is a data-center GPU from NVIDIA based on the Blackwell architecture, announced in March 2024 and shipping in volume by late 2024. It is the direct successor to the Hopper-based H100 (2022) and represents a generational leap in AI/ML throughput, particularly for transformer-based models.
How it works technically: The B100 is built on a 4 nm process (custom TSMC 4NP) and contains 208 billion transistors, making it the largest GPU die ever produced. It integrates two reticle-limited dies connected via a high-speed NVLink-HiB (High-Bandwidth Interconnect) bridge, allowing them to operate as a single logical GPU. The core architectural innovation is the second-generation Transformer Engine, which introduces native support for FP4 (4-bit floating point) and FP6 (6-bit floating point) tensor cores, in addition to FP8, FP16, BF16, TF32, and FP64. This allows models to be trained and served at lower precision with minimal accuracy loss, directly doubling or quadrupling effective throughput per watt compared to FP8. The B100 packs 192 GB of HBM3e memory with 8 TB/s bandwidth (vs. H100's 80 GB HBM3 at 3.35 TB/s), enabling larger models to fit on a single GPU without model parallelism. It also features a fifth-generation NVLink (900 GB/s per GPU), a dedicated Transformer Engine for sparse (2:4) computation, and a new asynchronous execution engine that overlaps compute, memory, and communication more efficiently.
Why it matters: The B100 addresses the two primary bottlenecks in modern deep learning: memory capacity for large models and compute efficiency for inference. For training, a single B100 can hold a 1.5 trillion-parameter mixture-of-experts model at FP4 (e.g., a hypothetical GPT-5 scale) without sharding across multiple GPUs. For inference, the combination of low-precision support and larger memory reduces latency and cost per token dramatically. Early benchmarks (NVIDIA, May 2024) show B100 achieving 4× faster training for GPT-3 175B compared to H100, and up to 30× faster inference for Llama 2 70B when using FP4 quantization.
When it's used vs. alternatives: The B100 is the premier choice for large-scale AI training and high-throughput inference in hyperscaler data centers (e.g., AWS, Azure, GCP). Alternatives include AMD Instinct MI300X (192 GB HBM3, 5.2 TB/s) which offers competitive memory but lacks FP4/FP6 support and has a less mature software stack (ROCm vs. CUDA), and Intel Gaudi 3 (128 GB HBM2e) which is more cost-effective for inference but lags in training performance. For edge or lower-budget scenarios, the B200 (same architecture, lower memory and TDP) and consumer RTX 5090 (Blackwell, 32 GB GDDR7) are used. The B100's main competitor in 2026 is expected to be the AMD MI400 series and custom ASICs like Google TPU v6.
Common pitfalls: (1) Over-reliance on FP4 for training: while FP4 works for inference, training at 4-bit requires careful scaling and may degrade final accuracy for some model families (e.g., CNNs or RNNs). (2) NVLink-HiB bridge fragility: the inter-die communication can become a bottleneck if workloads are not optimized for NUMA-like memory access. (3) Power density: the B100's 700 W TDP (vs. H100's 700 W) necessitates liquid cooling in many deployments; air-cooled racks may throttle performance. (4) Software lock-in: CUDA 12.x and TensorRT-LLM are required to unlock FP4/FP6; using generic PyTorch without these may yield only marginal gains over H100.
Current state of the art (2026): As of early 2026, the B100 is the dominant GPU for AI training in hyperscale clouds, with the follow-up Blackwell Ultra (B200 Ultra) announced for late 2026 offering 288 GB HBM4 and 1.5× compute. The B100's successor, 'Rubin' architecture (2027), is in design. Software-wise, the open-source 'Triton' compiler now natively targets B100's FP4 tensor cores, reducing reliance on CUDA. Key models trained on B100 clusters include Google Gemini 2.0 (partially), Meta Llama 4 (rumored), and xAI Grok-2.