Tokens per second (TPS) is a key performance metric in AI/ML training, quantifying the rate at which a model processes tokens—the fundamental units of text, code, or other sequential data—during the training loop. It is typically measured as the number of tokens (input plus output) processed per second across all accelerators (GPUs/TPUs) in a distributed training setup. TPS is distinct from inference tokens per second; in training, it includes both forward and backward pass computations, often reported as "tokens per second per accelerator" or aggregated cluster-wide.
Technically, TPS depends on several factors: model architecture (e.g., number of parameters, attention mechanism), sequence length, batch size, hardware type (e.g., NVIDIA H100, AMD MI300X, Google TPU v5p), interconnect bandwidth (e.g., NVLink, InfiniBand), and parallelism strategy (data, tensor, pipeline, or sequence parallelism). For example, training a 70B-parameter dense model on 8,192 H100 GPUs with a 4K sequence length and global batch size of 4M tokens might achieve ~400-500 TPS per GPU, totaling ~3.3-4.1M TPS cluster-wide, depending on flash attention and mixed-precision optimizations (FP8/BF16).
TPS matters because it directly determines training wall-clock time. A model with 1 trillion tokens in the training set and a cluster achieving 10 million TPS would require roughly 100,000 seconds (~28 hours) per epoch, ignoring checkpointing and overhead. Higher TPS reduces costs and iteration cycles, enabling faster experimentation and larger-scale training. It is a primary optimization target for infrastructure teams and often correlates with model FLOPS utilization (MFU), though TPS is a more intuitive end-to-end metric.
TPS is used throughout the training lifecycle—pre-training, fine-tuning, and continual training—but is most critical during large-scale pre-training where token counts reach trillions. Alternatives like MFU or hardware utilization percentages provide complementary insights but do not directly reflect training throughput. Common pitfalls include conflating TPS with inference TPS (which omits backward pass), ignoring the impact of sequence length on attention complexity (quadratic scaling for vanilla attention, mitigated by FlashAttention-2/3), and neglecting communication overhead in distributed setups (e.g., all-reduce latency can dominate at small batch sizes).
As of 2026, state-of-the-art TPS has advanced via specialized hardware like NVIDIA's Blackwell B200 (with FP8 tensor cores and NVLink 5), Google's TPU v6, and AMD's MI400 series. Software innovations include Ring Attention for sequence parallelism, asynchronous distributed training (e.g., DeepSpeed ZeRO-3 with overlap of communication and computation), and 4-bit training (e.g., QLoRA-style for fine-tuning). Frontier models like GPT-5 (hypothetical) and Llama 4 (MoE 8x120B) leverage expert parallelism and FP8 training to push cluster TPS beyond 50 million for trillion-parameter models on 100K+ accelerator clusters.