Compute & Accelerators
GPUs, TPUs, Trainium, MI300X, wafer-scale Cerebras chips. The silicon doing the actual training and inference, what specs really matter (HBM bandwidth, NVLink reach, FP8/FP4 throughput), and how a single GB200 NVL72 rack delivers 1.4 EFLOPS.
1 · What actually matters in an AI accelerator
Headlines focus on FLOPS. Engineers building real clusters care about three numbers more:
FLOPS are easy to advertise but hard to use. MFU (Model FLOPs Utilization) — the fraction of peak FLOPS your training run actually achieves — is usually 30–55% on real workloads. Memory bandwidth and interconnect dominate the rest.
2 · The accelerator lineup
NVIDIA — the dominant platform
NVIDIA's edge isn't a single chip — it's the integrated stack: chip + NVLink + NVSwitch + InfiniBand (Mellanox) + CUDA + cuDNN + NCCL + TensorRT. Replacing any one piece costs years of software engineering at the customer.
- H100 SXM5 (2022): 80 GB HBM3, 3.35 TB/s, 700 W. Workhorse of the GPT-4 / Llama-3 era.
- H200 (2024): drop-in upgrade, same TDP, but 141 GB HBM3e at 4.8 TB/s — ~75% more memory.
- B200 (2025): Blackwell architecture, 192 GB HBM3e at 8 TB/s, 1000 W TDP.
- GB200: B200 + Grace CPU on the same board. Two B200 + one Grace = "Grace-Blackwell Superchip".
AMD MI300X — the credible alternative
192 GB HBM3 at 5.3 TB/s, ~750 W. AMD shipped these to Microsoft, Meta, Oracle, and others through 2024–2025. The hardware is competitive; the software stack (ROCm) has closed but still trails CUDA.
Google TPU v5p
Custom Google silicon, only available in Google Cloud. ~459 BF16 TFLOPS, 95 GB HBM2e, 2.8 TB/s. Used internally for Gemini training. TPU pods scale to 8,960 chips with custom optical interconnect.
AWS Trainium2
AWS's second-gen training chip. ~1.3 PFLOPS BF16, 96 GB HBM3. Powering Project Rainier — Anthropic's 2.2 GW Amazon-built cluster announced November 2025.
Cerebras WSE-3
Wafer-scale: a single chip the size of a dinner plate (46,225 mm²), 4 trillion transistors, 900,000 cores, 44 GB on-die SRAM at 21 PB/s. Niche but extraordinary for inference at scale.
Source: NVIDIA H100/H200/B200 datasheets; AMD Instinct MI300X spec sheet; Google Cloud TPU v5p documentation; AWS Trainium2 product page; Cerebras WSE-3 announcement.
3 · The GB200 NVL72 rack — the new unit of compute
Before NVL72, scaling beyond 8 GPUs meant going through InfiniBand — much slower than on-die NVLink. With NVL72, NVIDIA put 72 GPUs into one NVLink domain with 1.8 TB/s bidirectional per GPU. For models that fit, training looks like running on a single ~14 TB GPU.
4 · Precisions: FP32 → FP16 → FP8 → FP4
Lower precision = more throughput, less memory, slightly less accuracy. The industry has moved aggressively down the precision curve:
The "1.4 EFLOPS" NVL72 number is FP4. The same rack does 720 PFLOPS at FP8 and 360 PFLOPS at BF16. Always check which precision a vendor is quoting.
Lesson 04 — TL;DR
- • HBM capacity + bandwidth + interconnect matter more than peak FLOPS.
- • NVIDIA's moat is the full stack: silicon + NVLink + Mellanox + CUDA.
- • AMD MI300X is competitive hardware; software is the gap.
- • GB200 NVL72 = 72 GPUs in one NVLink domain; 14 TB HBM, 1.4 EF FP4.
- • Precision matters: always check FP32/BF16/FP8/FP4 when comparing FLOPS claims.