Lesson 05/12Intermediate16 min read·5 diagrams

Network Fabric

Training a frontier model means thousands of GPUs synchronizing gradients many times per second. Standard Ethernet doesn't cut it. This lesson covers InfiniBand vs Ethernet, NVLink scale-up vs scale-out, CLOS topology, and rail-optimized layouts.

1 · Scale-up vs scale-out

Two completely different network problems live in an AI cluster:

Scale-up (within a node)

8–72 GPUs that need to act like one big GPU. Microsecond latency, terabit bandwidth. Solution: NVLink + NVSwitch (or AMD Infinity Fabric, Google ICI).

Scale-out (across nodes)

Connect 100s–10,000s of nodes into one training job. Single-digit microsecond latency, hundreds of Gbps per port. Solution: InfiniBand or RoCE/Ultra Ethernet.

2 · Speeds today

Network Speeds — Per Port / Per GPU10 GbE10 Gb/s2002100 GbE100 Gb/s2010400 GbE / NDR IB400 Gb/s2017/2022800 GbE / XDR IB800 Gb/s2024NVLink 5 (per GPU)1800 Gb/s2024NVLink is scale-up (within rack); InfiniBand/Ethernet is scale-out (across racks).
Per-port for fabric, per-GPU bidirectional for NVLink. Ports are the wires; GPUs may attach with multiple ports.
NDR InfiniBand
400 Gbps
2022 Mellanox/NVIDIA
XDR InfiniBand
800 Gbps
2024 deployments
GDR (planned)
1.6 Tbps
Roadmap
NVLink 5
1.8 TB/s
bidir, per GPU on Blackwell
800G Ethernet
800 Gbps
UEC roadmap

3 · InfiniBand vs Ultra Ethernet

InfiniBand (now NVIDIA-owned via Mellanox) was purpose-built for HPC and dominates AI training fabrics. It has lossless flow control, hardware-offloaded RDMA, and a tight integration with NCCL (NVIDIA's collective communications library).

Ultra Ethernet is a 2023 industry consortium response (AMD, Broadcom, Cisco, Meta, Microsoft, Oracle, others) trying to bring InfiniBand-class semantics to standard Ethernet — open, multi-vendor, commodity optics. The first UEC 1.0 spec released in 2024; production deployments starting 2025–2026.

4 · CLOS / fat-tree topology

Fat-Tree (CLOS) Topology — every leaf reaches every leaf in 4 hopsSpine 1Spine 2Spine 3Spine 4Leaf 1Leaf 2Leaf 3Leaf 4Leaf 5Leaf 6Leaf 7Leaf 8Server groupServer groupServer groupServer groupServer groupServer groupServer groupServer groupTwo layers shown. Real AI fabrics use 3-tier CLOS for 8k–100k+ GPUs. Bandwidth is non-blocking: any GPU can talk to any other at line rate.
2-tier CLOS shown for clarity. Real AI fabrics use 3-tier (super-spine layer above) for 8k–100k+ GPUs.

CLOS networks (named after Charles Clos, 1953) provide non-blocking bandwidth: any leaf can talk to any other leaf at full speed, with no path conflicts. The cost is many switch ports — a 32k-GPU cluster needs ~3,000 switches.

Rail-optimized topology

For tensor-parallel and pipeline-parallel training, NVIDIA recommends "rail-optimized" topology: every server's nth GPU port goes to the nth rail (a separate physical CLOS network). This keeps gradient sync within a single rail, dramatically reducing tail latency.

5 · Optics — what's actually plugged in

Above 100G, you're using pluggable optical transceivers: QSFP56, QSFP-DD, OSFP. Each is a small fiber-laser module that converts electrical signals to light. At 800G a single QSFP-DD module can cost $1,000+; a 32k-GPU cluster needs tens of thousands.

DAC (passive copper)
≤3 m
Cheap, in-rack
AOC (active optical)
≤30 m
Pre-terminated fiber
Transceiver + fiber
any distance
Modular, replaceable
LPO (linear pluggable)
emerging
Lower power, lower latency

Lesson 05 — TL;DR

  • • Two networks: scale-up (NVLink) within rack, scale-out (IB/Ethernet) across racks.
  • • InfiniBand at 400/800 Gbps is the dominant scale-out fabric for AI today.
  • • Ultra Ethernet is the multi-vendor open challenger — first deployments in 2025–2026.
  • • CLOS = non-blocking topology; rail-optimized layouts isolate gradient sync to one rail.
  • • Optics are expensive and power-hungry; LPO/CPO are the near-term answer.

Useful? Share so the next engineer learns this faster.

Share: