Skip to content
gentic.news — AI News Intelligence Platform
Connecting to the Living Graph…
Training & Inference

Tokens per Second: definition + examples

Tokens per second (TPS) is a key performance metric in AI/ML training, quantifying the rate at which a model processes tokens—the fundamental units of text, code, or other sequential data—during the training loop. It is typically measured as the number of tokens (input plus output) processed per second across all accelerators (GPUs/TPUs) in a distributed training setup. TPS is distinct from inference tokens per second; in training, it includes both forward and backward pass computations, often reported as "tokens per second per accelerator" or aggregated cluster-wide.

Technically, TPS depends on several factors: model architecture (e.g., number of parameters, attention mechanism), sequence length, batch size, hardware type (e.g., NVIDIA H100, AMD MI300X, Google TPU v5p), interconnect bandwidth (e.g., NVLink, InfiniBand), and parallelism strategy (data, tensor, pipeline, or sequence parallelism). For example, training a 70B-parameter dense model on 8,192 H100 GPUs with a 4K sequence length and global batch size of 4M tokens might achieve ~400-500 TPS per GPU, totaling ~3.3-4.1M TPS cluster-wide, depending on flash attention and mixed-precision optimizations (FP8/BF16).

TPS matters because it directly determines training wall-clock time. A model with 1 trillion tokens in the training set and a cluster achieving 10 million TPS would require roughly 100,000 seconds (~28 hours) per epoch, ignoring checkpointing and overhead. Higher TPS reduces costs and iteration cycles, enabling faster experimentation and larger-scale training. It is a primary optimization target for infrastructure teams and often correlates with model FLOPS utilization (MFU), though TPS is a more intuitive end-to-end metric.

TPS is used throughout the training lifecycle—pre-training, fine-tuning, and continual training—but is most critical during large-scale pre-training where token counts reach trillions. Alternatives like MFU or hardware utilization percentages provide complementary insights but do not directly reflect training throughput. Common pitfalls include conflating TPS with inference TPS (which omits backward pass), ignoring the impact of sequence length on attention complexity (quadratic scaling for vanilla attention, mitigated by FlashAttention-2/3), and neglecting communication overhead in distributed setups (e.g., all-reduce latency can dominate at small batch sizes).

As of 2026, state-of-the-art TPS has advanced via specialized hardware like NVIDIA's Blackwell B200 (with FP8 tensor cores and NVLink 5), Google's TPU v6, and AMD's MI400 series. Software innovations include Ring Attention for sequence parallelism, asynchronous distributed training (e.g., DeepSpeed ZeRO-3 with overlap of communication and computation), and 4-bit training (e.g., QLoRA-style for fine-tuning). Frontier models like GPT-5 (hypothetical) and Llama 4 (MoE 8x120B) leverage expert parallelism and FP8 training to push cluster TPS beyond 50 million for trillion-parameter models on 100K+ accelerator clusters.

Examples

  • Llama 3.1 405B pre-training on 16K H100 GPUs achieved ~380 TPS per GPU with 8K sequence length and FSDP.
  • GPT-4 (rumored MoE 1.8T parameters) used pipeline parallelism to reach ~1.2M cluster TPS on 25K A100 GPUs.
  • Google's PaLM 2 (540B) on TPU v4 pods reported ~1.5M TPS per pod (4,096 chips) with 2K sequence length.
  • DeepSeek-V2 (236B MoE) used multi-head latent attention and achieved ~2,100 TPS per GPU on 8x H800 with 32K context.
  • Megatron-LM's 530B model on 4,480 A100 GPUs demonstrated 1.2M TPS with 2K sequence length and tensor/pipeline parallelism.

Related terms

Model FLOPS Utilization (MFU)Distributed TrainingFlashAttentionMixed Precision TrainingThroughput

Latest news mentioning Tokens per Second

FAQ

What is Tokens per Second?

Tokens per second (TPS) measures the number of input or output tokens a model processes per second during training, directly reflecting training throughput and hardware utilization.

How does Tokens per Second work?

Tokens per second (TPS) is a key performance metric in AI/ML training, quantifying the rate at which a model processes tokens—the fundamental units of text, code, or other sequential data—during the training loop. It is typically measured as the number of tokens (input plus output) processed per second across all accelerators (GPUs/TPUs) in a distributed training setup. TPS is distinct…

Where is Tokens per Second used in 2026?

Llama 3.1 405B pre-training on 16K H100 GPUs achieved ~380 TPS per GPU with 8K sequence length and FSDP. GPT-4 (rumored MoE 1.8T parameters) used pipeline parallelism to reach ~1.2M cluster TPS on 25K A100 GPUs. Google's PaLM 2 (540B) on TPU v4 pods reported ~1.5M TPS per pod (4,096 chips) with 2K sequence length.