Lesson 04/12Intermediate18 min read·5 diagrams

Compute & Accelerators

GPUs, TPUs, Trainium, MI300X, wafer-scale Cerebras chips. The silicon doing the actual training and inference, what specs really matter (HBM bandwidth, NVLink reach, FP8/FP4 throughput), and how a single GB200 NVL72 rack delivers 1.4 EFLOPS.

1 · What actually matters in an AI accelerator

Headlines focus on FLOPS. Engineers building real clusters care about three numbers more:

HBM capacity
GB
Bigger model fits per device
HBM bandwidth
TB/s
Inference latency floor
Interconnect BW
GB/s
Multi-GPU scaling

FLOPS are easy to advertise but hard to use. MFU (Model FLOPs Utilization) — the fraction of peak FLOPS your training run actually achieves — is usually 30–55% on real workloads. Memory bandwidth and interconnect dominate the rest.

2 · The accelerator lineup

Frontier AI Accelerators (per chip)AcceleratorVendorReleasedHBMBandwidthTDPH100 SXM5NVIDIA202280GB HBM33.35 TB/s700WH200NVIDIA2024141GB HBM3e4.8 TB/s700WB200NVIDIA2025192GB HBM3e8.0 TB/s1000WMI300XAMD2023192GB HBM35.3 TB/s750WTPU v5pGoogle202395GB HBM2e2.8 TB/sTrainium2AWS202496GB HBM32.9 TB/sSources: NVIDIA H100/H200/B200 datasheets, AMD MI300X datasheet, Google Cloud TPU v5p docs, AWS Trainium2 announcement.
Per-chip specifications. Real cluster economics also factor interconnect, software maturity, and supply.

NVIDIA — the dominant platform

NVIDIA's edge isn't a single chip — it's the integrated stack: chip + NVLink + NVSwitch + InfiniBand (Mellanox) + CUDA + cuDNN + NCCL + TensorRT. Replacing any one piece costs years of software engineering at the customer.

  • H100 SXM5 (2022): 80 GB HBM3, 3.35 TB/s, 700 W. Workhorse of the GPT-4 / Llama-3 era.
  • H200 (2024): drop-in upgrade, same TDP, but 141 GB HBM3e at 4.8 TB/s — ~75% more memory.
  • B200 (2025): Blackwell architecture, 192 GB HBM3e at 8 TB/s, 1000 W TDP.
  • GB200: B200 + Grace CPU on the same board. Two B200 + one Grace = "Grace-Blackwell Superchip".

AMD MI300X — the credible alternative

192 GB HBM3 at 5.3 TB/s, ~750 W. AMD shipped these to Microsoft, Meta, Oracle, and others through 2024–2025. The hardware is competitive; the software stack (ROCm) has closed but still trails CUDA.

Google TPU v5p

Custom Google silicon, only available in Google Cloud. ~459 BF16 TFLOPS, 95 GB HBM2e, 2.8 TB/s. Used internally for Gemini training. TPU pods scale to 8,960 chips with custom optical interconnect.

AWS Trainium2

AWS's second-gen training chip. ~1.3 PFLOPS BF16, 96 GB HBM3. Powering Project Rainier — Anthropic's 2.2 GW Amazon-built cluster announced November 2025.

Cerebras WSE-3

Wafer-scale: a single chip the size of a dinner plate (46,225 mm²), 4 trillion transistors, 900,000 cores, 44 GB on-die SRAM at 21 PB/s. Niche but extraordinary for inference at scale.

Source: NVIDIA H100/H200/B200 datasheets; AMD Instinct MI300X spec sheet; Google Cloud TPU v5p documentation; AWS Trainium2 product page; Cerebras WSE-3 announcement.

3 · The GB200 NVL72 rack — the new unit of compute

GB200 NVL72 — One Rack, One GPU DomainNVSNVSNVSNVSNVSNVSNVSNVSNVSCDU (liquid coolant)Power Shelf (415V)Compute trays18 trays × 4 B200= 72 GPUsNVSwitch trays9 trays, 5th-gen NVSwitch1.8 TB/s/GPU~120 kWtotal rack power1.4 EFLOPSFP4 inference14 TBtotal HBM3e36 GraceCPUs (ARM)
72 B200 GPUs + 36 Grace CPUs in a single liquid-cooled rack. NVLink 5 makes them appear as one giant GPU domain.

Before NVL72, scaling beyond 8 GPUs meant going through InfiniBand — much slower than on-die NVLink. With NVL72, NVIDIA put 72 GPUs into one NVLink domain with 1.8 TB/s bidirectional per GPU. For models that fit, training looks like running on a single ~14 TB GPU.

GPUs
72 × B200
Liquid cooled
CPUs
36 × Grace
ARM, NVLink-attached
HBM total
14 TB
HBM3e
Compute
1.4 EFLOPS FP4
720 PFLOPS FP8
NVLink BW
1.8 TB/s
per GPU bidirectional
Power
~120 kW
per rack

4 · Precisions: FP32 → FP16 → FP8 → FP4

Lower precision = more throughput, less memory, slightly less accuracy. The industry has moved aggressively down the precision curve:

FP32
full
Legacy training (rare now)
BF16 / FP16
half
Standard training
FP8
quarter
H100+ training, frontier inference
FP4
eighth
Blackwell inference, MXFP4 format
INT8
quarter
Quantized inference
INT4
eighth
Aggressive quantization

The "1.4 EFLOPS" NVL72 number is FP4. The same rack does 720 PFLOPS at FP8 and 360 PFLOPS at BF16. Always check which precision a vendor is quoting.

Lesson 04 — TL;DR

  • • HBM capacity + bandwidth + interconnect matter more than peak FLOPS.
  • • NVIDIA's moat is the full stack: silicon + NVLink + Mellanox + CUDA.
  • • AMD MI300X is competitive hardware; software is the gap.
  • • GB200 NVL72 = 72 GPUs in one NVLink domain; 14 TB HBM, 1.4 EF FP4.
  • • Precision matters: always check FP32/BF16/FP8/FP4 when comparing FLOPS claims.

Useful? Share so the next engineer learns this faster.

Share: