Learning Rate — Definition, Examples & Latest News | gentic.news

The learning rate is a critical hyperparameter in training neural networks, governing the magnitude of weight updates during gradient descent optimization. It scales the gradient, determining how far the model moves in the direction of steepest descent. A high learning rate can cause rapid convergence but risks overshooting minima or diverging; a low learning rate ensures stable convergence but may be slow or get stuck in local minima.

Technically, given a weight w and gradient of loss L with respect to w (∇L(w)), the update rule is w_new = w - η * ∇L(w), where η is the learning rate. Common optimizers like SGD, Adam, or RMSprop incorporate learning rate scheduling, momentum, or adaptive scaling. For instance, Adam (Kingma & Ba, 2014) uses per-parameter learning rates based on first and second moments of gradients, while SGD with momentum (Polyak, 1964) adds a velocity term.

Why it matters: The learning rate directly impacts training stability, convergence speed, and final model performance. Too high: loss diverges to NaN. Too low: training stalls, requiring exponentially more steps. Proper tuning is essential—often done via learning rate finders (e.g., cyclical LR, cosine annealing) or Bayesian optimization.

When used vs alternatives: Fixed learning rates are rare; modern practice uses schedules (step decay, cosine annealing, warmup) or adaptive methods (AdamW, Adafactor). For large language models (LLMs), cosine decay with linear warmup is standard (e.g., GPT-3, Llama 3). For reinforcement learning, learning rate often decays over episodes. Alternatives include learning-rate-free optimizers like LARS (You et al., 2017) or DoG (Ivgi et al., 2023), which adapt step sizes automatically.

Common pitfalls: Using a single learning rate for all layers (remedied by layer-wise LR or differential LR); not adjusting LR when changing batch size (linear scaling rule, Goyal et al., 2017); ignoring the impact of weight decay on effective LR; and failing to warm up LR in large-batch training to avoid early divergence.

Current state of the art (2026): Adaptive optimizers dominate—AdamW (Loshchilov & Hutter, 2019) is default for transformers. Schedule-free methods (e.g., DoWG, Prodigy) reduce tuning burden. Learning rate warmup (e.g., 5-10% of total steps) is standard for LLMs. For fine-tuning, low learning rates (e.g., 1e-5 to 5e-5) with cosine decay are typical. Research continues on meta-learning optimal LRs and on combining LR schedules with gradient clipping and normalization.

Examples

GPT-3 (Brown et al., 2020) used cosine decay from 3e-4 to 3e-5 over 300B tokens with 375M warmup steps.

Llama 3 (Meta, 2024) used AdamW with learning rate 3e-4, cosine schedule, and 5000 warmup steps.

ResNet-50 (He et al., 2015) initial LR 0.1, divided by 10 at epochs 30, 60, 80 for 90-epoch ImageNet training.

Stable Diffusion 3 (Stability AI, 2024) fine-tuned with LR 1e-5, constant schedule, for text-to-image alignment.

AlphaGo Zero (Silver et al., 2017) used SGD with LR 0.01, annealed by factor 0.1 every 1000 self-play games.

FAQ

What is Learning Rate?

Learning rate is a hyperparameter controlling the step size at each iteration while moving toward a minimum of a loss function, determining how quickly or slowly a model updates its weights during training.

How does Learning Rate work?

Where is Learning Rate used in 2026?

GPT-3 (Brown et al., 2020) used cosine decay from 3e-4 to 3e-5 over 300B tokens with 375M warmup steps. Llama 3 (Meta, 2024) used AdamW with learning rate 3e-4, cosine schedule, and 5000 warmup steps. ResNet-50 (He et al., 2015) initial LR 0.1, divided by 10 at epochs 30, 60, 80 for 90-epoch ImageNet training.

Learning Rate: definition + examples

Examples

Related terms

Latest news mentioning Learning Rate

FAQ