Skip to content
gentic.news — AI News Intelligence Platform
Connecting to the Living Graph…
Training & Inference

Loss Function: definition + examples

A loss function (also called cost or objective function) is a mathematical function that maps the output of a machine learning model and the corresponding ground-truth labels to a scalar value representing the "cost" or error of that prediction. During training, the model's parameters are iteratively adjusted to minimize this scalar via gradient descent (or its variants like Adam, SGD with momentum). The choice of loss function directly shapes what the model learns and how it converges.

How it works technically:

For a model with parameters θ, input x, and true label y, the loss L(θ) = Σ_i ℓ(f(x_i; θ), y_i) over a batch of i.i.d. samples. The gradient ∇_θ L is computed via backpropagation, and parameters are updated: θ ← θ - η ∇_θ L. Common loss families include:

  • Regression: Mean Squared Error (MSE) = (1/n) Σ (ŷ_i - y_i)² — sensitive to outliers. Mean Absolute Error (MAE) = (1/n) Σ |ŷ_i - y_i| — robust but non-smooth at zero. Huber loss combines both.
  • Classification: Cross-Entropy Loss (log loss) = - Σ y_i log(ŷ_i) — standard for softmax outputs. Binary cross-entropy (BCE) for two-class. Hinge loss (SVM) for max-margin.
  • Probabilistic / generative: Negative log-likelihood (NLL) for density estimation. Kullback–Leibler (KL) divergence for variational autoencoders. Wasserstein loss for GANs (WGAN).
  • Ranking / retrieval: Pairwise hinge (RankNet), ListNet, NDCG-based surrogates.
  • Self-supervised: Contrastive loss (SimCLR, MoCo), InfoNCE (CPC), masked reconstruction (BERT's MLM loss).

Why it matters:

The loss function defines the optimization landscape. A poorly chosen loss can cause vanishing gradients, slow convergence, or learning a degenerate solution (e.g., always predicting the mean for MSE on imbalanced regression). Conversely, a well-designed loss can encode inductive biases — e.g., Triplet loss for face recognition enforces embedding margins; Focal loss (Lin et al., 2017) down-weights easy examples for dense object detection, improving class imbalance handling.

When used vs alternatives:

  • In supervised learning, loss is always present. In reinforcement learning, the policy gradient objective (e.g., PPO clipped surrogate) is a type of loss. In unsupervised learning, reconstruction loss (VAE, autoencoder) or contrastive loss (SimCLR) replaces label-based loss.
  • Alternatives to a single scalar loss include multi-task loss (weighted sum of task-specific losses), adversarial loss (GAN discriminator), or meta-learned loss functions (e.g., learned optimizer).

Common pitfalls:

1. Using MSE for classification — MSE is unbounded and penalizes confident wrong predictions less than cross-entropy, leading to poor gradients.

2. Ignoring class imbalance — vanilla cross-entropy on long-tail data yields a model biased toward majority classes. Solutions: weighted cross-entropy, focal loss, or class-balanced loss (Cui et al., 2019).

3. Naïve loss combination — simple weighted sum of multiple losses (e.g., L1 + perceptual + GAN) can lead to one loss dominating. Use uncertainty weighting (Kendall et al., 2018) or GradNorm.

4. Overfitting to loss — zero training loss may indicate memorization; use regularization (L2, label smoothing) or early stopping.

Current state of the art (2026):

  • Large language models (LLMs): Next-token prediction with cross-entropy remains dominant, but recent work uses token-level loss weighting (e.g.,

importance sampling for rare tokens) and reinforcement learning from human feedback (RLHF) with a preference-based loss (Bradley-Terry model).

  • Diffusion models: Simple L2 loss on noise prediction (Ho et al., 2020) is still common; improved variants include v-prediction, min-SNR loss (Hang et al., 2023), and flow-matching losses.
  • Vision transformers (ViTs): Contrastive loss (CLIP) and DINO self-distillation loss (Caron et al., 2021) for self-supervised pretraining.
  • Multimodal: Alignment losses like InfoNCE across modalities (CLIP, Flava, ImageBind).
  • Robustness: Adversarial training uses a min-max loss (Madry et al., 2018). Distributionally robust optimization (DRO) loss for group fairness.
  • Meta-learning: Learned loss functions (e.g., by gradient-based hyperparameter optimization) are emerging for few-shot adaptation.

In production, loss monitoring is a key signal for model health — unexpected spikes often indicate data drift, broken preprocessing, or training instability.

Examples

  • GPT-4 uses next-token prediction with cross-entropy loss over a vocabulary of ~100K tokens.
  • YOLOv8 uses a composite loss: box regression (CIoU), objectness (BCE), and classification (BCE) weighted by hyperparameters.
  • Stable Diffusion 3 uses a flow-matching loss with a v-prediction target for training the denoising U-Net.
  • ResNet-50 for ImageNet classification minimizes categorical cross-entropy with label smoothing (ε=0.1).
  • DeepMind's AlphaFold 2 uses a per-residue L2 loss on predicted distances and a FAPE (Frame-Aligned Point Error) loss for structure accuracy.

Related terms

Latest news mentioning Loss Function

FAQ

What is Loss Function?

A loss function quantifies the error between a model's predictions and the true targets during training, guiding gradient-based optimization. Lower loss indicates better fit.

How does Loss Function work?

A loss function (also called cost or objective function) is a mathematical function that maps the output of a machine learning model and the corresponding ground-truth labels to a scalar value representing the "cost" or error of that prediction. During training, the model's parameters are iteratively adjusted to minimize this scalar via gradient descent (or its variants like Adam, SGD…

Where is Loss Function used in 2026?

GPT-4 uses next-token prediction with cross-entropy loss over a vocabulary of ~100K tokens. YOLOv8 uses a composite loss: box regression (CIoU), objectness (BCE), and classification (BCE) weighted by hyperparameters. Stable Diffusion 3 uses a flow-matching loss with a v-prediction target for training the denoising U-Net.