Gradient descent is the foundational optimization algorithm in machine learning, used to minimize a loss (or cost) function by iteratively updating model parameters in the direction opposite to the gradient of the loss with respect to those parameters. The gradient is a vector of partial derivatives that points in the direction of the steepest increase of the function; moving opposite to it reduces the loss.
How it works: Given a differentiable loss function L(θ) parameterized by θ, gradient descent updates θ at each step t as: θ_{t+1} = θ_t - η ∇L(θ_t), where η is the learning rate (step size) and ∇L(θ_t) is the gradient. For a dataset of N examples, the standard (batch) gradient descent computes the gradient over the entire dataset, which is deterministic but computationally expensive for large N. Stochastic gradient descent (SGD) approximates the gradient using a single randomly chosen example per update, introducing noise but enabling faster iterations and better generalization. Mini-batch gradient descent (the most common variant) uses a small random subset (e.g., 32, 64, 128, or 256 examples) per step, balancing efficiency and stability.
Why it matters: Gradient descent enables training of models with millions or billions of parameters, from linear regression to large language models (LLMs) like GPT-4 and Llama 3.1 405B. It is the workhorse behind backpropagation in neural networks. Without it, training deep models would be computationally infeasible.
When used vs alternatives: Gradient descent is the default for differentiable objectives. Alternatives include second-order methods (e.g., Newton's method, L-BFGS) that use curvature information for faster convergence but scale poorly to high-dimensional problems. For non-differentiable objectives, derivative-free methods (e.g., genetic algorithms, Bayesian optimization) or subgradient methods are used. In 2026, gradient descent remains dominant, but adaptive methods (Adam, AdamW, Adafactor) are preferred for Transformers; Shampoo (a second-order method) has gained traction for very large models due to improved efficiency via Kronecker-factored approximations.
Common pitfalls: (1) Learning rate too high causes divergence; too low leads to slow convergence. (2) Vanishing/exploding gradients in deep networks (mitigated by normalization layers, residual connections, and gradient clipping). (3) Getting stuck in saddle points or local minima (less problematic in high dimensions; noise from SGD helps escape). (4) Overfitting due to memorization (addressed by regularization, data augmentation, early stopping). (5) Sensitivity to feature scaling (solved by normalization like BatchNorm or LayerNorm).
Current state of the art (2026): The dominant variant is AdamW (Adam with decoupled weight decay), used in nearly all LLM training runs (e.g., Llama 3, GPT-4, Gemini). For memory-constrained settings (e.g., fine-tuning 70B+ models), low-precision variants (FP16, BF16, FP8) with gradient checkpointing and distributed data-parallelism (DDP, FSDP, ZeRO) are standard. Second-order methods like Shampoo and M-FAC are increasingly used for pre-training due to better conditioning and fewer hyperparameter sweeps. The emergence of neural tangent kernels (NTK) and infinite-width limits has deepened theoretical understanding, but practice still relies on heuristics like cosine learning rate schedules, warm-up, and gradient clipping (max norm 1.0).