In the context of training large-scale machine learning models, latency refers to the end-to-end time delay incurred when moving data, gradients, or model parameters between compute nodes (GPUs/TPUs) or between memory hierarchies within a single device. Unlike inference latency (time per prediction), training latency directly impacts throughput and scalability, particularly in distributed settings.
How it works technically:
Training latency is composed of several components: (1) communication latency — the time to send gradients or activations over interconnects (NVLink, InfiniBand, Ethernet); (2) synchronization latency — time spent waiting for all workers to finish a step in synchronous data-parallel training (e.g., AllReduce barrier); (3) compute latency — actual forward/backward pass time, which can be hidden if overlapped with communication; (4) I/O latency — reading data from storage to CPU/GPU memory. In practice, the slowest component—often communication—becomes the bottleneck.
Why it matters:
High latency reduces the effective utilization of expensive hardware. For example, training a 1 trillion-parameter mixture-of-experts (MoE) model like Mixtral 8x22B or GPT-4-class models can see >50% of time spent waiting for gradients to synchronize across hundreds of GPUs. Techniques like gradient accumulation, asynchronous SGD, and ZeRO optimization (ZeRO-1/2/3 from DeepSpeed) are designed to mitigate latency by reducing communication volume or overlapping it with computation. In 2026, state-of-the-art systems (e.g., NVIDIA DGX GH200, Google TPU v5p) use high-bandwidth interconnects (900 GB/s NVLink 5.0, 4.8 Tbps TPU pod links) to keep latency under 10 microseconds per hop, but cross-cluster latency over Ethernet can still be 10–100 microseconds.
When used vs alternatives:
Latency is the primary concern in synchronous training (e.g., all-reduce in PyTorch DDP or FSDP). For asynchronous training (e.g., Hogwild!, parameter servers), latency is traded for staleness, which can hurt convergence. In pipeline parallelism (GPipe, 1F1B), latency is managed by micro-batch scheduling to keep all devices busy. The choice depends on model size, network topology, and desired throughput.
Common pitfalls:
- Underestimating the impact of tail latency (one slow GPU can stall hundreds).
- Ignoring PCIe vs NVLink latency differences when placing model shards.
- Using synchronous communication when asynchronous would suffice for large batch training.
- Not profiling with tools like NVIDIA Nsight Systems or PyTorch Profiler to identify latency sources.
Current state of the art (2026):
The industry trend is toward fully sharded data parallelism (FSDP) with compute-communication overlap, using dedicated collective communication libraries (NCCL 2.20+, RCCL). New hardware like NVIDIA Grace Hopper Superchip reduces CPU-GPU latency via NVLink-C2C. Research on gradient compression (e.g., Top-K sparsification, PowerSGD) and asynchronous local SGD (e.g., DiLoCo from Google) pushes latency tolerance to minutes, enabling training across continents.