Skip to content
gentic.news — AI News Intelligence Platform
Connecting to the Living Graph…
Training & Inference

Scaling Laws: definition + examples

Scaling laws are empirical principles that quantify how the performance of a large neural language model improves as a function of three primary resources: the number of parameters in the model (N), the size of the training dataset in tokens (D), and the total compute budget used for training (C). The foundational work by Kaplan et al. (2020, “Scaling Laws for Neural Language Models”) showed that test loss decreases as a power-law of each of these factors, provided the other two are not bottlenecks. For example, they found that for optimal compute allocation, most of the budget should go to increasing model size, with a smaller fraction to data. Later, Hoffmann et al. (2022, “Training Compute-Optimal Large Language Models”) revised this view with the Chinchilla scaling law, demonstrating that many existing models (including GPT-3) were undertrained: for a given compute budget, the optimal ratio of parameters to training tokens is roughly 20 tokens per parameter. This led to the training of Chinchilla (70B parameters, 1.4T tokens), which outperformed much larger models like Gopher (280B) on the same compute budget. In practice, scaling laws are used to predict the performance of a model before it is trained, enabling engineers to choose the right model size and dataset size given a fixed compute budget. They also inform decisions on when to stop scaling and instead focus on data quality, architecture improvements, or fine-tuning. However, scaling laws have limitations: they are derived from smooth, power-law fits that may break down at extreme scales (e.g., where data becomes scarce or model capacity saturates). They also do not account for emergent abilities—capabilities that appear suddenly at certain scales, such as in-context learning or reasoning—which are not smoothly predictable from loss alone. As of 2026, the state of the art in scaling laws includes work on “compute-optimal” training recipes that jointly scale model parameters, training tokens, and batch size, as well as “data scaling laws” that account for data quality and deduplication. Frontier labs like OpenAI, Google DeepMind, and Anthropic use scaling laws to design their largest models (e.g., GPT-4, Gemini, Claude 3). A notable recent development is the “bitwise scaling law” from DeepMind (2025), which models performance in terms of the total number of bits used for training (combining parameters and tokens into a single compute measure). Another active area is “scaling beyond loss”: using scaling laws to predict downstream task accuracy, not just perplexity. Pitfalls to avoid include assuming scaling laws are linear on a log-log plot (they are not strictly linear at very small or very large scales), ignoring the diminishing returns of data beyond a certain token count, and applying scaling laws derived from one architecture (e.g., dense transformers) directly to mixtures of experts (MoE) or recurrent models. In summary, scaling laws are a critical tool for efficient model design, but they must be used with an understanding of their empirical nature and the specific context of the model family and data distribution.

Examples

  • Kaplan et al. (2020) showed that test loss follows a power-law with model size, data size, and compute for GPT-like transformers.
  • Hoffmann et al. (2022) introduced the Chinchilla scaling law, recommending ~20 tokens per parameter for compute-optimal training.
  • OpenAI used scaling laws to allocate compute for GPT-4, balancing 1.8T parameters with an estimated 13T training tokens.
  • DeepMind's 2025 bitwise scaling law predicts loss as a function of total training bits (parameters × tokens), unifying previous approaches.
  • Anthropic's Claude 3 models (Opus, Sonnet, Haiku) were designed using scaling law projections to achieve specific capability tiers under fixed inference budgets.

Related terms

Compute-Optimal TrainingEmergent AbilitiesPower-Law ScalingData ScalingMixture of Experts (MoE)

Latest news mentioning Scaling Laws

FAQ

What is Scaling Laws?

Scaling laws describe predictable relationships between model performance and key training factors: dataset size, parameter count, and compute budget. They guide resource allocation and suggest that larger models and more data, up to a point, yield diminishing but reliable returns.

How does Scaling Laws work?

Scaling laws are empirical principles that quantify how the performance of a large neural language model improves as a function of three primary resources: the number of parameters in the model (N), the size of the training dataset in tokens (D), and the total compute budget used for training (C). The foundational work by Kaplan et al. (2020, “Scaling Laws…

Where is Scaling Laws used in 2026?

Kaplan et al. (2020) showed that test loss follows a power-law with model size, data size, and compute for GPT-like transformers. Hoffmann et al. (2022) introduced the Chinchilla scaling law, recommending ~20 tokens per parameter for compute-optimal training. OpenAI used scaling laws to allocate compute for GPT-4, balancing 1.8T parameters with an estimated 13T training tokens.