Skip to content
gentic.news — AI News Intelligence Platform
Connecting to the Living Graph…
Training & Inference

Throughput: definition + examples

Throughput, in the context of training machine learning models, quantifies the number of training examples (samples, tokens, or sequences) processed by the training system per unit of time. It is a critical performance metric for evaluating the efficiency of a training pipeline, especially in large-scale distributed training scenarios. Throughput is distinct from latency (the time to process a single example) and is typically measured in samples per second (samples/s) or tokens per second (tokens/s).

Technically, throughput is determined by several factors: the compute capacity of accelerators (e.g., NVIDIA H100, AMD MI300X), memory bandwidth, the efficiency of data loading and preprocessing pipelines, the model architecture (e.g., transformer depth, attention head count), and the parallelism strategy employed (data parallelism, tensor parallelism, pipeline parallelism, or a combination like 3D parallelism). In modern training, throughput is often bottlenecked by communication overhead in distributed setups. For example, training a dense transformer model like GPT-3 (175B parameters) on a cluster of 1024 A100 GPUs achieved a throughput of approximately 150 trillion FLOPs per second (TFLOPS) per GPU, but the overall sample throughput is limited by the need to synchronize gradients across all devices.

Why throughput matters: higher throughput directly reduces total training time for a fixed number of epochs or steps. This is economically critical because training large models incurs substantial compute costs (e.g., training Llama 3.1 405B cost an estimated $60–100 million in compute). Throughput also influences research iteration speed: faster training enables more hyperparameter sweeps, architecture experiments, and faster time-to-deployment.

Throughput is used as the primary optimization target when designing training infrastructure. Alternatives like focusing solely on model quality (e.g., achieving lower loss per step) are complementary; throughput optimizations aim to achieve the same loss in fewer wall-clock minutes. Common pitfalls include: (1) confusing throughput with latency — a system with low per-step latency might still have low throughput if batch sizes are small; (2) ignoring data loading bottlenecks — if the GPU is idle waiting for data, effective throughput plummets; (3) underestimating communication overhead in distributed settings — as model size scales, all-reduce communication can dominate step time; (4) using too large a batch size to maximize throughput can degrade model quality (the generalization gap), necessitating techniques like learning rate warmup or gradient accumulation.

As of 2026, the state of the art in training throughput involves several innovations. NVIDIA's H100 B200 GPUs with NVLink 5 achieve up to 900 GB/s inter-GPU bandwidth, enabling near-linear scaling for dense models. Google's TPU v6 (Trillium) delivers 4x throughput per pod compared to TPU v4, with 2,048 chips interconnected via an optical circuit-switched network. Software frameworks like PyTorch 2.x with torch.compile and FSDP2 (Fully Sharded Data Parallel) have reduced overhead, achieving >90% of theoretical peak FLOPS on large models. For sparse models, Mixture-of-Experts (MoE) architectures like Mixtral 8x22B achieve higher throughput per parameter by activating only a subset of experts per token. Additionally, sequence parallelism and ring attention enable training on context lengths exceeding 1 million tokens (e.g., Gemini 1.5 Pro) without sacrificing throughput. The trend is toward co-design of hardware, software, and model architecture to push throughput closer to the physical limits of compute and memory bandwidth.

Examples

  • Training GPT-3 175B on 1024 A100 GPUs achieved ~150 TFLOPS/GPU, corresponding to ~10,000 samples/s with a batch size of 3.2M tokens.
  • Llama 3.1 405B used 16,384 H100 GPUs with 3D parallelism (data + tensor + pipeline) to reach a throughput of ~400 TFLOPS/GPU, completing pretraining in ~54 days.
  • Google's TPU v4 pod with 4096 chips trains a 1.6T parameter sparse MoE model at ~1 exaFLOP aggregate throughput, processing ~50 million tokens per second.
  • PyTorch FSDP2 on 8 A100 GPUs trains a 7B parameter model at 1,200 tokens/s/GPU with a global batch size of 4M tokens, achieving 85% scaling efficiency.
  • DeepSpeed ZeRO-3 with offload to CPU memory allows training a 20B parameter model on a single GPU by trading throughput (dropping to ~500 tokens/s) for memory capacity.

Related terms

FLOPs utilizationParallelism (Data, Tensor, Pipeline)Memory bandwidthScalability (strong vs weak)Latency (training step)

Latest news mentioning Throughput

FAQ

What is Throughput?

Throughput in training is the rate at which a system processes training examples per unit time, typically measured in samples per second or tokens per second.

How does Throughput work?

Throughput, in the context of training machine learning models, quantifies the number of training examples (samples, tokens, or sequences) processed by the training system per unit of time. It is a critical performance metric for evaluating the efficiency of a training pipeline, especially in large-scale distributed training scenarios. Throughput is distinct from latency (the time to process a single example)…

Where is Throughput used in 2026?

Training GPT-3 175B on 1024 A100 GPUs achieved ~150 TFLOPS/GPU, corresponding to ~10,000 samples/s with a batch size of 3.2M tokens. Llama 3.1 405B used 16,384 H100 GPUs with 3D parallelism (data + tensor + pipeline) to reach a throughput of ~400 TFLOPS/GPU, completing pretraining in ~54 days. Google's TPU v4 pod with 4096 chips trains a 1.6T parameter sparse MoE model at ~1 exaFLOP aggregate throughput, processing ~50 million tokens per second.