Lesson 04/12Intermediate18 min read·5 diagrams

Compute & Accelerators

GPUs, TPUs, Trainium, MI300X, wafer-scale Cerebras chips. The silicon doing the actual training and inference, what specs really matter (HBM bandwidth, NVLink reach, FP8/FP4 throughput), and how a single GB200 NVL72 rack delivers 1.4 EFLOPS.

1 · What actually matters in an AI accelerator

Headlines focus on FLOPS. Engineers building real clusters care about three numbers more:

HBM capacity

Bigger model fits per device

HBM bandwidth

TB/s

Inference latency floor

Interconnect BW

GB/s

Multi-GPU scaling

FLOPS are easy to advertise but hard to use. MFU (Model FLOPs Utilization) — the fraction of peak FLOPS your training run actually achieves — is usually 30–55% on real workloads. Memory bandwidth and interconnect dominate the rest.

2 · The accelerator lineup

Per-chip specifications. Real cluster economics also factor interconnect, software maturity, and supply.

NVIDIA — the dominant platform

NVIDIA's edge isn't a single chip — it's the integrated stack: chip + NVLink + NVSwitch + InfiniBand (Mellanox) + CUDA + cuDNN + NCCL + TensorRT. Replacing any one piece costs years of software engineering at the customer.

H100 SXM5 (2022): 80 GB HBM3, 3.35 TB/s, 700 W. Workhorse of the GPT-4 / Llama-3 era.
H200 (2024): drop-in upgrade, same TDP, but 141 GB HBM3e at 4.8 TB/s — ~75% more memory.
B200 (2025): Blackwell architecture, 192 GB HBM3e at 8 TB/s, 1000 W TDP.
GB200: B200 + Grace CPU on the same board. Two B200 + one Grace = "Grace-Blackwell Superchip".
GB300 NVL72 (Blackwell Ultra, in volume 2026): 288 GB HBM3e per GPU, ~1,400 W, ~1.1 EFLOPS FP4 per rack, ConnectX-8 at 800 Gb/s — tuned for reasoning / test-time-compute inference.
Rubin VR200 NVL144 (partner availability H2 2026): 288 GB HBM4, ~50 PFLOPS FP4 per GPU, new Vera CPU replacing Grace; NVL144 ≈ 3.6 EFLOPS FP4 (~3.3× GB300). Rubin Ultra (2027) targets ~100 PFLOPS + 1 TB HBM4e.

AMD MI300X — the credible alternative

192 GB HBM3 at 5.3 TB/s, ~750 W. AMD shipped these to Microsoft, Meta, Oracle, and others through 2024–2025. The hardware is competitive; the software stack (ROCm) has closed but still trails CUDA. The 2025 refresh MI325X adds 256 GB HBM3e at 6 TB/s (a direct H200 answer), and MI355X (CDNA 4, 288 GB, ~8 TB/s, native FP4/FP6) is AMD's Blackwell-class part — Meta committed to a ~6 GW AMD GPU deal in early 2026.

Google TPU v5p

Custom Google silicon, only available in Google Cloud. ~459 BF16 TFLOPS, 95 GB HBM2e, 2.8 TB/s. Used internally for Gemini training. TPU pods scale to 8,960 chips with custom optical interconnect. The 2025 generation TPU v7 "Ironwood" is inference-focused (~4.6 PFLOPS FP8/chip, pods to 9,216 chips) and underpins Anthropic's up-to-1-million-TPU commitment with Google.

AWS Trainium2

AWS's second-gen training chip. ~1.3 PFLOPS BF16, 96 GB HBM3. Powering Project Rainier — Anthropic's 2.2 GW Amazon-built cluster announced November 2025.

Cerebras WSE-3

Wafer-scale: a single chip the size of a dinner plate (46,225 mm²), 4 trillion transistors, 900,000 cores, 44 GB on-die SRAM at 21 PB/s. Niche but extraordinary for inference at scale.

Source: NVIDIA H100/H200/B200 datasheets; AMD Instinct MI300X spec sheet; Google Cloud TPU v5p documentation; AWS Trainium2 product page; Cerebras WSE-3 announcement.

3 · The GB200 NVL72 rack — the new unit of compute

72 B200 GPUs + 36 Grace CPUs in a single liquid-cooled rack. NVLink 5 makes them appear as one giant GPU domain.

Before NVL72, scaling beyond 8 GPUs meant going through InfiniBand — much slower than on-die NVLink. With NVL72, NVIDIA put 72 GPUs into one NVLink domain with 1.8 TB/s bidirectional per GPU. For models that fit, training looks like running on a single ~13.5 TB GPU.

GPUs

72 × B200

Liquid cooled

CPUs

36 × Grace

ARM, NVLink-attached

HBM total

13.5 TB

HBM3e

Compute

1.4 EFLOPS FP4

720 PFLOPS FP8

NVLink BW

1.8 TB/s

per GPU bidirectional

Power

~120 kW

per rack · 132 kW rated

4 · Precisions: FP32 → FP16 → FP8 → FP4

Lower precision = more throughput, less memory, slightly less accuracy. The industry has moved aggressively down the precision curve:

FP32

full

Legacy training (rare now)

BF16 / FP16

half

Standard training

FP8

quarter

H100+ training, frontier inference

FP4

eighth

Blackwell inference, MXFP4 format

INT8

quarter

Quantized inference

INT4

eighth

Aggressive quantization

The "1.4 EFLOPS" NVL72 number is FP4. The same rack does 720 PFLOPS at FP8 and 360 PFLOPS at BF16. Always check which precision a vendor is quoting.

Lesson 04 — TL;DR

• HBM capacity + bandwidth + interconnect matter more than peak FLOPS.
• NVIDIA's moat is the full stack: silicon + NVLink + Mellanox + CUDA.
• AMD MI300X is competitive hardware; software is the gap.
• GB200 NVL72 = 72 GPUs in one NVLink domain; 13.5 TB HBM, 1.4 EF FP4.
• Precision matters: always check FP32/BF16/FP8/FP4 when comparing FLOPS claims.

References

NVIDIA H100 Tensor Core GPU Datasheet — 80 GB HBM3, 3.35 TB/s, 700 W TDP
NVIDIA H200 Product Brief — 141 GB HBM3e, 4.8 TB/s
NVIDIA Blackwell B200 / GB200 NVL72 Whitepaper — 192 GB HBM3e, 8 TB/s, 1000 W TDP
AMD Instinct MI300X Datasheet — 192 GB HBM3, 5.3 TB/s, 750 W
Google Cloud TPU v5p Documentation — 95 GB HBM, 2.8 TB/s
AWS Trainium2 Overview — UltraServer interconnect

→ Apply what you learned

See this concept in the real world

🧰Tool · GPU Comparator (side-by-side)→📍Case · 450K GB200 at Abilene→📍Case · 500K Trainium2 at Rainier→🧰Tool · Train Llama-405B in 90 days?→

Useful? Share so the next engineer learns this faster.