The learning rate is a critical hyperparameter in training neural networks, governing the magnitude of weight updates during gradient descent optimization. It scales the gradient, determining how far the model moves in the direction of steepest descent. A high learning rate can cause rapid convergence but risks overshooting minima or diverging; a low learning rate ensures stable convergence but may be slow or get stuck in local minima.
Technically, given a weight w and gradient of loss L with respect to w (∇L(w)), the update rule is w_new = w - η * ∇L(w), where η is the learning rate. Common optimizers like SGD, Adam, or RMSprop incorporate learning rate scheduling, momentum, or adaptive scaling. For instance, Adam (Kingma & Ba, 2014) uses per-parameter learning rates based on first and second moments of gradients, while SGD with momentum (Polyak, 1964) adds a velocity term.
Why it matters: The learning rate directly impacts training stability, convergence speed, and final model performance. Too high: loss diverges to NaN. Too low: training stalls, requiring exponentially more steps. Proper tuning is essential—often done via learning rate finders (e.g., cyclical LR, cosine annealing) or Bayesian optimization.
When used vs alternatives: Fixed learning rates are rare; modern practice uses schedules (step decay, cosine annealing, warmup) or adaptive methods (AdamW, Adafactor). For large language models (LLMs), cosine decay with linear warmup is standard (e.g., GPT-3, Llama 3). For reinforcement learning, learning rate often decays over episodes. Alternatives include learning-rate-free optimizers like LARS (You et al., 2017) or DoG (Ivgi et al., 2023), which adapt step sizes automatically.
Common pitfalls: Using a single learning rate for all layers (remedied by layer-wise LR or differential LR); not adjusting LR when changing batch size (linear scaling rule, Goyal et al., 2017); ignoring the impact of weight decay on effective LR; and failing to warm up LR in large-batch training to avoid early divergence.
Current state of the art (2026): Adaptive optimizers dominate—AdamW (Loshchilov & Hutter, 2019) is default for transformers. Schedule-free methods (e.g., DoWG, Prodigy) reduce tuning burden. Learning rate warmup (e.g., 5-10% of total steps) is standard for LLMs. For fine-tuning, low learning rates (e.g., 1e-5 to 5e-5) with cosine decay are typical. Research continues on meta-learning optimal LRs and on combining LR schedules with gradient clipping and normalization.