Compute & Accelerators
GPUs, TPUs, Trainium, MI300X, wafer-scale Cerebras chips. The silicon doing the actual training and inference, what specs really matter (HBM bandwidth, NVLink reach, FP8/FP4 throughput), and how a single GB200 NVL72 rack delivers 1.4 EFLOPS.
1 · What actually matters in an AI accelerator
Headlines focus on FLOPS. Engineers building real clusters care about three numbers more:
FLOPS are easy to advertise but hard to use. MFU (Model FLOPs Utilization) — the fraction of peak FLOPS your training run actually achieves — is usually 30–55% on real workloads. Memory bandwidth and interconnect dominate the rest.
2 · The accelerator lineup
NVIDIA — the dominant platform
NVIDIA's edge isn't a single chip — it's the integrated stack: chip + NVLink + NVSwitch + InfiniBand (Mellanox) + CUDA + cuDNN + NCCL + TensorRT. Replacing any one piece costs years of software engineering at the customer.
- H100 SXM5 (2022): 80 GB HBM3, 3.35 TB/s, 700 W. Workhorse of the GPT-4 / Llama-3 era.
- H200 (2024): drop-in upgrade, same TDP, but 141 GB HBM3e at 4.8 TB/s — ~75% more memory.
- B200 (2025): Blackwell architecture, 192 GB HBM3e at 8 TB/s, 1000 W TDP.
- GB200: B200 + Grace CPU on the same board. Two B200 + one Grace = "Grace-Blackwell Superchip".
- GB300 NVL72 (Blackwell Ultra, in volume 2026): 288 GB HBM3e per GPU, ~1,400 W, ~1.1 EFLOPS FP4 per rack, ConnectX-8 at 800 Gb/s — tuned for reasoning / test-time-compute inference.
- Rubin VR200 NVL144 (partner availability H2 2026): 288 GB HBM4, ~50 PFLOPS FP4 per GPU, new Vera CPU replacing Grace; NVL144 ≈ 3.6 EFLOPS FP4 (~3.3× GB300). Rubin Ultra (2027) targets ~100 PFLOPS + 1 TB HBM4e.
AMD MI300X — the credible alternative
192 GB HBM3 at 5.3 TB/s, ~750 W. AMD shipped these to Microsoft, Meta, Oracle, and others through 2024–2025. The hardware is competitive; the software stack (ROCm) has closed but still trails CUDA. The 2025 refresh MI325X adds 256 GB HBM3e at 6 TB/s (a direct H200 answer), and MI355X (CDNA 4, 288 GB, ~8 TB/s, native FP4/FP6) is AMD's Blackwell-class part — Meta committed to a ~6 GW AMD GPU deal in early 2026.
Google TPU v5p
Custom Google silicon, only available in Google Cloud. ~459 BF16 TFLOPS, 95 GB HBM2e, 2.8 TB/s. Used internally for Gemini training. TPU pods scale to 8,960 chips with custom optical interconnect. The 2025 generation TPU v7 "Ironwood" is inference-focused (~4.6 PFLOPS FP8/chip, pods to 9,216 chips) and underpins Anthropic's up-to-1-million-TPU commitment with Google.
AWS Trainium2
AWS's second-gen training chip. ~1.3 PFLOPS BF16, 96 GB HBM3. Powering Project Rainier — Anthropic's 2.2 GW Amazon-built cluster announced November 2025.
Cerebras WSE-3
Wafer-scale: a single chip the size of a dinner plate (46,225 mm²), 4 trillion transistors, 900,000 cores, 44 GB on-die SRAM at 21 PB/s. Niche but extraordinary for inference at scale.
Source: NVIDIA H100/H200/B200 datasheets; AMD Instinct MI300X spec sheet; Google Cloud TPU v5p documentation; AWS Trainium2 product page; Cerebras WSE-3 announcement.
3 · The GB200 NVL72 rack — the new unit of compute
Before NVL72, scaling beyond 8 GPUs meant going through InfiniBand — much slower than on-die NVLink. With NVL72, NVIDIA put 72 GPUs into one NVLink domain with 1.8 TB/s bidirectional per GPU. For models that fit, training looks like running on a single ~13.5 TB GPU.
4 · Precisions: FP32 → FP16 → FP8 → FP4
Lower precision = more throughput, less memory, slightly less accuracy. The industry has moved aggressively down the precision curve:
The "1.4 EFLOPS" NVL72 number is FP4. The same rack does 720 PFLOPS at FP8 and 360 PFLOPS at BF16. Always check which precision a vendor is quoting.
Lesson 04 — TL;DR
- • HBM capacity + bandwidth + interconnect matter more than peak FLOPS.
- • NVIDIA's moat is the full stack: silicon + NVLink + Mellanox + CUDA.
- • AMD MI300X is competitive hardware; software is the gap.
- • GB200 NVL72 = 72 GPUs in one NVLink domain; 13.5 TB HBM, 1.4 EF FP4.
- • Precision matters: always check FP32/BF16/FP8/FP4 when comparing FLOPS claims.
References
- NVIDIA H100 Tensor Core GPU Datasheet — 80 GB HBM3, 3.35 TB/s, 700 W TDP
- NVIDIA H200 Product Brief — 141 GB HBM3e, 4.8 TB/s
- NVIDIA Blackwell B200 / GB200 NVL72 Whitepaper — 192 GB HBM3e, 8 TB/s, 1000 W TDP
- AMD Instinct MI300X Datasheet — 192 GB HBM3, 5.3 TB/s, 750 W
- Google Cloud TPU v5p Documentation — 95 GB HBM, 2.8 TB/s
- AWS Trainium2 Overview — UltraServer interconnect
→ Apply what you learned