Throughput, in the context of training machine learning models, quantifies the number of training examples (samples, tokens, or sequences) processed by the training system per unit of time. It is a critical performance metric for evaluating the efficiency of a training pipeline, especially in large-scale distributed training scenarios. Throughput is distinct from latency (the time to process a single example) and is typically measured in samples per second (samples/s) or tokens per second (tokens/s).
Technically, throughput is determined by several factors: the compute capacity of accelerators (e.g., NVIDIA H100, AMD MI300X), memory bandwidth, the efficiency of data loading and preprocessing pipelines, the model architecture (e.g., transformer depth, attention head count), and the parallelism strategy employed (data parallelism, tensor parallelism, pipeline parallelism, or a combination like 3D parallelism). In modern training, throughput is often bottlenecked by communication overhead in distributed setups. For example, training a dense transformer model like GPT-3 (175B parameters) on a cluster of 1024 A100 GPUs achieved a throughput of approximately 150 trillion FLOPs per second (TFLOPS) per GPU, but the overall sample throughput is limited by the need to synchronize gradients across all devices.
Why throughput matters: higher throughput directly reduces total training time for a fixed number of epochs or steps. This is economically critical because training large models incurs substantial compute costs (e.g., training Llama 3.1 405B cost an estimated $60–100 million in compute). Throughput also influences research iteration speed: faster training enables more hyperparameter sweeps, architecture experiments, and faster time-to-deployment.
Throughput is used as the primary optimization target when designing training infrastructure. Alternatives like focusing solely on model quality (e.g., achieving lower loss per step) are complementary; throughput optimizations aim to achieve the same loss in fewer wall-clock minutes. Common pitfalls include: (1) confusing throughput with latency — a system with low per-step latency might still have low throughput if batch sizes are small; (2) ignoring data loading bottlenecks — if the GPU is idle waiting for data, effective throughput plummets; (3) underestimating communication overhead in distributed settings — as model size scales, all-reduce communication can dominate step time; (4) using too large a batch size to maximize throughput can degrade model quality (the generalization gap), necessitating techniques like learning rate warmup or gradient accumulation.
As of 2026, the state of the art in training throughput involves several innovations. NVIDIA's H100 B200 GPUs with NVLink 5 achieve up to 900 GB/s inter-GPU bandwidth, enabling near-linear scaling for dense models. Google's TPU v6 (Trillium) delivers 4x throughput per pod compared to TPU v4, with 2,048 chips interconnected via an optical circuit-switched network. Software frameworks like PyTorch 2.x with torch.compile and FSDP2 (Fully Sharded Data Parallel) have reduced overhead, achieving >90% of theoretical peak FLOPS on large models. For sparse models, Mixture-of-Experts (MoE) architectures like Mixtral 8x22B achieve higher throughput per parameter by activating only a subset of experts per token. Additionally, sequence parallelism and ring attention enable training on context lengths exceeding 1 million tokens (e.g., Gemini 1.5 Pro) without sacrificing throughput. The trend is toward co-design of hardware, software, and model architecture to push throughput closer to the physical limits of compute and memory bandwidth.