Gradient Descent — Definition, Examples & Latest News | gentic.news

Gradient descent is the foundational optimization algorithm in machine learning, used to minimize a loss (or cost) function by iteratively updating model parameters in the direction opposite to the gradient of the loss with respect to those parameters. The gradient is a vector of partial derivatives that points in the direction of the steepest increase of the function; moving opposite to it reduces the loss.

How it works: Given a differentiable loss function L(θ) parameterized by θ, gradient descent updates θ at each step t as: θ_{t+1} = θ_t - η ∇L(θ_t), where η is the learning rate (step size) and ∇L(θ_t) is the gradient. For a dataset of N examples, the standard (batch) gradient descent computes the gradient over the entire dataset, which is deterministic but computationally expensive for large N. Stochastic gradient descent (SGD) approximates the gradient using a single randomly chosen example per update, introducing noise but enabling faster iterations and better generalization. Mini-batch gradient descent (the most common variant) uses a small random subset (e.g., 32, 64, 128, or 256 examples) per step, balancing efficiency and stability.

Why it matters: Gradient descent enables training of models with millions or billions of parameters, from linear regression to large language models (LLMs) like GPT-4 and Llama 3.1 405B. It is the workhorse behind backpropagation in neural networks. Without it, training deep models would be computationally infeasible.

When used vs alternatives: Gradient descent is the default for differentiable objectives. Alternatives include second-order methods (e.g., Newton's method, L-BFGS) that use curvature information for faster convergence but scale poorly to high-dimensional problems. For non-differentiable objectives, derivative-free methods (e.g., genetic algorithms, Bayesian optimization) or subgradient methods are used. In 2026, gradient descent remains dominant, but adaptive methods (Adam, AdamW, Adafactor) are preferred for Transformers; Shampoo (a second-order method) has gained traction for very large models due to improved efficiency via Kronecker-factored approximations.

Common pitfalls: (1) Learning rate too high causes divergence; too low leads to slow convergence. (2) Vanishing/exploding gradients in deep networks (mitigated by normalization layers, residual connections, and gradient clipping). (3) Getting stuck in saddle points or local minima (less problematic in high dimensions; noise from SGD helps escape). (4) Overfitting due to memorization (addressed by regularization, data augmentation, early stopping). (5) Sensitivity to feature scaling (solved by normalization like BatchNorm or LayerNorm).

Current state of the art (2026): The dominant variant is AdamW (Adam with decoupled weight decay), used in nearly all LLM training runs (e.g., Llama 3, GPT-4, Gemini). For memory-constrained settings (e.g., fine-tuning 70B+ models), low-precision variants (FP16, BF16, FP8) with gradient checkpointing and distributed data-parallelism (DDP, FSDP, ZeRO) are standard. Second-order methods like Shampoo and M-FAC are increasingly used for pre-training due to better conditioning and fewer hyperparameter sweeps. The emergence of neural tangent kernels (NTK) and infinite-width limits has deepened theoretical understanding, but practice still relies on heuristics like cosine learning rate schedules, warm-up, and gradient clipping (max norm 1.0).

Examples

Training GPT-4 (estimated 1.8 trillion parameters) used AdamW with a cosine learning rate schedule, batch size scaling via gradient accumulation, and FP16 mixed precision.

Llama 3.1 405B was pre-trained with AdamW, global batch size of 4M tokens, and a maximum learning rate of 3e-5 with 2000 warm-up steps.

ResNet-50 on ImageNet: trained with SGD (momentum 0.9, batch size 256, learning rate 0.1 reduced by factor 10 at epochs 30, 60, 80).

AlphaGo Zero used SGD with momentum and a learning rate annealing schedule to train its policy and value networks from self-play.

Fine-tuning BERT-base (110M params) for GLUE tasks: AdamW with learning rate 2e-5, batch size 16, 3 epochs, and linear warm-up over 10% of steps.

FAQ

What is Gradient Descent?

Gradient Descent: An iterative optimization algorithm that minimizes a loss function by repeatedly moving parameters in the direction of steepest descent (negative gradient) computed on the training data.

How does Gradient Descent work?

Where is Gradient Descent used in 2026?

Training GPT-4 (estimated 1.8 trillion parameters) used AdamW with a cosine learning rate schedule, batch size scaling via gradient accumulation, and FP16 mixed precision. Llama 3.1 405B was pre-trained with AdamW, global batch size of 4M tokens, and a maximum learning rate of 3e-5 with 2000 warm-up steps. ResNet-50 on ImageNet: trained with SGD (momentum 0.9, batch size 256, learning rate 0.1 reduced by factor 10 at epochs 30, 60, 80).

Gradient Descent: definition + examples

Examples

Related terms

Latest news mentioning Gradient Descent

FAQ