Skip to content
gentic.news — AI News Intelligence Platform
Connecting to the Living Graph…
Training & Inference

Cross-Entropy Loss: definition + examples

Cross-Entropy Loss, also known as log loss, quantifies the dissimilarity between the true distribution (one-hot encoded labels) and the predicted distribution (softmax outputs). It is derived from information theory, where cross-entropy H(p,q) = -Σ p(x) log q(x) measures the average number of bits needed to encode events from distribution p using the optimal code for distribution q. In deep learning, minimizing cross-entropy is equivalent to maximizing the likelihood of the correct labels under the model.

How it works: For a single training example with true class c (one-hot vector p where p_c=1 and others 0), the loss is -log(q_c), where q_c is the model's predicted probability for class c. This penalizes confident wrong predictions heavily: if q_c is near 0, the loss approaches infinity; if q_c is near 1, the loss approaches 0. For a batch, the loss is averaged over all examples. In practice, implementations (e.g., PyTorch's CrossEntropyLoss, TensorFlow's CategoricalCrossentropy) combine softmax activation and negative log-likelihood into a single numerically stable operation to avoid floating-point underflow.

Why it matters: Cross-entropy is the default loss for classification because it provides strong gradients even when predictions are far from correct, unlike squared error which saturates. It is used in virtually every modern classifier: image recognition (ResNet, ViT), language modeling (GPT-4, Llama 3), and speech recognition (Whisper). In autoregressive language models, it is applied token-wise over a vocabulary (e.g., 128k tokens for GPT-4) and averaged across positions.

When used vs alternatives:

  • For binary classification, binary cross-entropy (BCE) is used. For multi-label classification (multiple correct labels per sample), binary cross-entropy with sigmoid per class is standard.
  • For regression, mean squared error (MSE) or mean absolute error (MAE) are preferred.
  • For tasks with severe class imbalance, weighted cross-entropy or focal loss (a modulation of cross-entropy that down-weights easy examples) often perform better. Focal loss was introduced in RetinaNet (2017) for object detection and is now common in long-tail recognition.
  • For ranking or contrastive learning, pairwise losses (e.g., triplet loss, InfoNCE) replace cross-entropy.

Common pitfalls:

  • Numerical instability: Raw softmax followed by log can produce NaN if probabilities underflow. Modern frameworks fuse softmax + cross-entropy into one function.
  • Overconfidence: Cross-entropy encourages models to assign probability 1 to the correct class, which can lead to overfitting and poor calibration. Label smoothing (Szegedy et al., 2016) mitigates this by replacing hard 0/1 targets with smoothed values (e.g., 0.9/0.1), improving generalization. It is standard in models like EfficientNet and PaLM.
  • Ignoring label noise: Cross-entropy is not robust to mislabeled examples because it tries to fit every label exactly. Robust alternatives include symmetric cross-entropy, generalized cross-entropy, or using a noise transition matrix.
  • Gradient magnitude for easy examples: Easy examples (where q_c is already high) produce tiny gradients, slowing convergence. Focal loss addresses this.

Current state of the art (2026): Cross-entropy remains the foundation for training most large-scale models, but modifications are standard. Label smoothing is applied universally in transformer-based language models. For vision, sigmoid cross-entropy (with binary cross-entropy per class) has become popular in open-vocabulary detectors (e.g., GLIP, Grounding DINO) because it naturally handles multiple labels. In reinforcement learning from human feedback (RLHF), cross-entropy is used in the supervised fine-tuning (SFT) phase, but the RL phase uses preference-based losses (e.g., Bradley-Terry). Research continues on loss functions that improve calibration (e.g., focal loss for better uncertainty estimates) and robustness (e.g., logit adjustment for long-tail data). Overall, cross-entropy is not obsolete but is increasingly augmented with techniques that address its known limitations.

Examples

  • GPT-4's language modeling head uses cross-entropy loss over a 100k+ token vocabulary, averaging over all positions in a sequence.
  • ResNet-50 trained on ImageNet uses categorical cross-entropy with softmax for 1000-class classification.
  • Llama 3 70B's supervised fine-tuning (SFT) phase minimizes cross-entropy on human-written demonstrations before RLHF.
  • RetinaNet (2017) replaced cross-entropy with focal loss for dense object detection, reducing loss contribution from easy negatives.
  • EfficientNet-B7 uses label smoothing (ε=0.1) with cross-entropy to improve top-1 accuracy on ImageNet by ~0.5%.

Related terms

SoftmaxNegative Log-LikelihoodFocal LossLabel SmoothingLog Loss

Latest news mentioning Cross-Entropy Loss

FAQ

What is Cross-Entropy Loss?

Cross-Entropy Loss measures the difference between two probability distributions — typically the true labels and the model's predictions — and is the standard loss function for multi-class classification tasks in neural networks.

How does Cross-Entropy Loss work?

Cross-Entropy Loss, also known as log loss, quantifies the dissimilarity between the true distribution (one-hot encoded labels) and the predicted distribution (softmax outputs). It is derived from information theory, where cross-entropy H(p,q) = -Σ p(x) log q(x) measures the average number of bits needed to encode events from distribution p using the optimal code for distribution q. In deep learning,…

Where is Cross-Entropy Loss used in 2026?

GPT-4's language modeling head uses cross-entropy loss over a 100k+ token vocabulary, averaging over all positions in a sequence. ResNet-50 trained on ImageNet uses categorical cross-entropy with softmax for 1000-class classification. Llama 3 70B's supervised fine-tuning (SFT) phase minimizes cross-entropy on human-written demonstrations before RLHF.