Network Fabric
Training a frontier model means thousands of GPUs synchronizing gradients many times per second. Standard Ethernet doesn't cut it. This lesson covers InfiniBand vs Ethernet, NVLink scale-up vs scale-out, CLOS topology, and rail-optimized layouts.
1 · Scale-up vs scale-out
Two completely different network problems live in an AI cluster:
Scale-up (within a node)
8–72 GPUs that need to act like one big GPU. Microsecond latency, terabit bandwidth. Solution: NVLink + NVSwitch (or AMD Infinity Fabric, Google ICI).
Scale-out (across nodes)
Connect 100s–10,000s of nodes into one training job. Single-digit microsecond latency, hundreds of Gbps per port. Solution: InfiniBand or RoCE/Ultra Ethernet.
2 · Speeds today
3 · InfiniBand vs Ultra Ethernet
InfiniBand (now NVIDIA-owned via Mellanox) was purpose-built for HPC and dominates AI training fabrics. It has lossless flow control, hardware-offloaded RDMA, and a tight integration with NCCL (NVIDIA's collective communications library).
Ultra Ethernet is a 2023 industry consortium response (AMD, Broadcom, Cisco, Meta, Microsoft, Oracle, others) trying to bring InfiniBand-class semantics to standard Ethernet — open, multi-vendor, commodity optics. The first UEC 1.0 spec released in 2024; production deployments starting 2025–2026.
4 · CLOS / fat-tree topology
CLOS networks (named after Charles Clos, 1953) provide non-blocking bandwidth: any leaf can talk to any other leaf at full speed, with no path conflicts. The cost is many switch ports — a 32k-GPU cluster needs ~3,000 switches.
Rail-optimized topology
For tensor-parallel and pipeline-parallel training, NVIDIA recommends "rail-optimized" topology: every server's nth GPU port goes to the nth rail (a separate physical CLOS network). This keeps gradient sync within a single rail, dramatically reducing tail latency.
5 · Optics — what's actually plugged in
Above 100G, you're using pluggable optical transceivers: QSFP56, QSFP-DD, OSFP. Each is a small fiber-laser module that converts electrical signals to light. At 800G a single QSFP-DD module can cost $1,000+; a 32k-GPU cluster needs tens of thousands.
Lesson 05 — TL;DR
- • Two networks: scale-up (NVLink) within rack, scale-out (IB/Ethernet) across racks.
- • InfiniBand at 400/800 Gbps is the dominant scale-out fabric for AI today.
- • Ultra Ethernet is the multi-vendor open challenger — first deployments in 2025–2026.
- • CLOS = non-blocking topology; rail-optimized layouts isolate gradient sync to one rail.
- • Optics are expensive and power-hungry; LPO/CPO are the near-term answer.