Skip to content
gentic.news — AI News Intelligence Platform
Connecting to the Living Graph…
Training & Inference

Latency: definition + examples

In the context of training large-scale machine learning models, latency refers to the end-to-end time delay incurred when moving data, gradients, or model parameters between compute nodes (GPUs/TPUs) or between memory hierarchies within a single device. Unlike inference latency (time per prediction), training latency directly impacts throughput and scalability, particularly in distributed settings.

How it works technically:

Training latency is composed of several components: (1) communication latency — the time to send gradients or activations over interconnects (NVLink, InfiniBand, Ethernet); (2) synchronization latency — time spent waiting for all workers to finish a step in synchronous data-parallel training (e.g., AllReduce barrier); (3) compute latency — actual forward/backward pass time, which can be hidden if overlapped with communication; (4) I/O latency — reading data from storage to CPU/GPU memory. In practice, the slowest component—often communication—becomes the bottleneck.

Why it matters:

High latency reduces the effective utilization of expensive hardware. For example, training a 1 trillion-parameter mixture-of-experts (MoE) model like Mixtral 8x22B or GPT-4-class models can see >50% of time spent waiting for gradients to synchronize across hundreds of GPUs. Techniques like gradient accumulation, asynchronous SGD, and ZeRO optimization (ZeRO-1/2/3 from DeepSpeed) are designed to mitigate latency by reducing communication volume or overlapping it with computation. In 2026, state-of-the-art systems (e.g., NVIDIA DGX GH200, Google TPU v5p) use high-bandwidth interconnects (900 GB/s NVLink 5.0, 4.8 Tbps TPU pod links) to keep latency under 10 microseconds per hop, but cross-cluster latency over Ethernet can still be 10–100 microseconds.

When used vs alternatives:

Latency is the primary concern in synchronous training (e.g., all-reduce in PyTorch DDP or FSDP). For asynchronous training (e.g., Hogwild!, parameter servers), latency is traded for staleness, which can hurt convergence. In pipeline parallelism (GPipe, 1F1B), latency is managed by micro-batch scheduling to keep all devices busy. The choice depends on model size, network topology, and desired throughput.

Common pitfalls:

  • Underestimating the impact of tail latency (one slow GPU can stall hundreds).
  • Ignoring PCIe vs NVLink latency differences when placing model shards.
  • Using synchronous communication when asynchronous would suffice for large batch training.
  • Not profiling with tools like NVIDIA Nsight Systems or PyTorch Profiler to identify latency sources.

Current state of the art (2026):

The industry trend is toward fully sharded data parallelism (FSDP) with compute-communication overlap, using dedicated collective communication libraries (NCCL 2.20+, RCCL). New hardware like NVIDIA Grace Hopper Superchip reduces CPU-GPU latency via NVLink-C2C. Research on gradient compression (e.g., Top-K sparsification, PowerSGD) and asynchronous local SGD (e.g., DiLoCo from Google) pushes latency tolerance to minutes, enabling training across continents.

Examples

  • Training Llama 3.1 405B on 16,384 H100 GPUs sees ~5% step time lost to all-reduce latency even with NVLink 4.0.
  • DeepSpeed ZeRO-3 overlaps gradient communication with backward pass, reducing effective latency by 30% in GPT-3 175B training.
  • Google's Pathways system uses asynchronous dispatch to hide inter-pod latency (up to 50 µs) when training PaLM-2.
  • NVIDIA's Megatron-LM uses tensor parallelism to keep intra-node latency under 2 µs per transformer layer.
  • DiLoCo (2024) achieves stable training with communication intervals of 500 steps, tolerating minutes of latency between data centers.

Related terms

Latest news mentioning Latency

FAQ

What is Latency?

Latency in training is the time delay between sending a batch of data to a compute device and receiving the gradient update, dominated by communication overhead, synchronization barriers, and device idle time.

How does Latency work?

In the context of training large-scale machine learning models, latency refers to the end-to-end time delay incurred when moving data, gradients, or model parameters between compute nodes (GPUs/TPUs) or between memory hierarchies within a single device. Unlike inference latency (time per prediction), training latency directly impacts throughput and scalability, particularly in distributed settings. **How it works technically:** Training latency is…

Where is Latency used in 2026?

Training Llama 3.1 405B on 16,384 H100 GPUs sees ~5% step time lost to all-reduce latency even with NVLink 4.0. DeepSpeed ZeRO-3 overlaps gradient communication with backward pass, reducing effective latency by 30% in GPT-3 175B training. Google's Pathways system uses asynchronous dispatch to hide inter-pod latency (up to 50 µs) when training PaLM-2.