Lesson 05/12Intermediate16 min read·5 diagrams

Network Fabric

Training a frontier model means thousands of GPUs synchronizing gradients many times per second. Standard Ethernet doesn't cut it. This lesson covers InfiniBand vs Ethernet, NVLink scale-up vs scale-out, CLOS topology, and rail-optimized layouts.

1 · Scale-up vs scale-out

Two completely different network problems live in an AI cluster:

Scale-up (within a node)

8–72 GPUs that need to act like one big GPU. Microsecond latency, terabit bandwidth. Solution: NVLink + NVSwitch (or AMD Infinity Fabric, Google ICI).

Scale-out (across nodes)

Connect 100s–10,000s of nodes into one training job. Single-digit microsecond latency, hundreds of Gbps per port. Solution: InfiniBand or RoCE/Ultra Ethernet.

2 · Speeds today

Per-port for fabric, per-GPU bidirectional for NVLink. Ports are the wires; GPUs may attach with multiple ports.

NDR InfiniBand

400 Gbps

2022 Mellanox/NVIDIA

XDR InfiniBand

800 Gbps

2024 deployments

GDR (planned)

1.6 Tbps

Roadmap

NVLink 5

1.8 TB/s

bidir, per GPU on Blackwell

800G Ethernet

800 Gbps

UEC roadmap

3 · InfiniBand vs Ultra Ethernet

InfiniBand (now NVIDIA-owned via Mellanox) was purpose-built for HPC and dominates AI training fabrics. It has lossless flow control, hardware-offloaded RDMA, and a tight integration with NCCL (NVIDIA's collective communications library).

Ultra Ethernet is a 2023 industry consortium response (AMD, Broadcom, Cisco, Meta, Microsoft, Oracle, others) trying to bring InfiniBand-class semantics to standard Ethernet — open, multi-vendor, commodity optics. The first UEC 1.0 spec released in 2024; production deployments starting 2025–2026.

4 · CLOS / fat-tree topology

2-tier CLOS shown for clarity. Real AI fabrics use 3-tier (super-spine layer above) for 8k–100k+ GPUs.

CLOS networks (named after Charles Clos, 1953) provide non-blocking bandwidth: any leaf can talk to any other leaf at full speed, with no path conflicts. The cost is many switch ports — a 32k-GPU cluster needs ~3,000 switches.

Rail-optimized topology

For tensor-parallel and pipeline-parallel training, NVIDIA recommends "rail-optimized" topology: every server's nth GPU port goes to the nth rail (a separate physical CLOS network). This keeps gradient sync within a single rail, dramatically reducing tail latency.

5 · Optics — what's actually plugged in

Above 100G, you're using pluggable optical transceivers: QSFP56, QSFP-DD, OSFP. Each is a small fiber-laser module that converts electrical signals to light. At 800G a single QSFP-DD module can cost $1,000+; a 32k-GPU cluster needs tens of thousands.

DAC (passive copper)

≤3 m

Cheap, in-rack

AOC (active optical)

≤30 m

Pre-terminated fiber

Transceiver + fiber

any distance

Modular, replaceable

LPO (linear pluggable)

emerging

Lower power, lower latency

Lesson 05 — TL;DR

• Two networks: scale-up (NVLink) within rack, scale-out (IB/Ethernet) across racks.
• InfiniBand at 400/800 Gbps is the dominant scale-out fabric for AI today.
• Ultra Ethernet is the multi-vendor open challenger — first deployments in 2025–2026.
• CLOS = non-blocking topology; rail-optimized layouts isolate gradient sync to one rail.
• Optics are expensive and power-hungry; LPO/CPO are the near-term answer.

Useful? Share so the next engineer learns this faster.