Cross-Entropy Loss, also known as log loss, quantifies the dissimilarity between the true distribution (one-hot encoded labels) and the predicted distribution (softmax outputs). It is derived from information theory, where cross-entropy H(p,q) = -Σ p(x) log q(x) measures the average number of bits needed to encode events from distribution p using the optimal code for distribution q. In deep learning, minimizing cross-entropy is equivalent to maximizing the likelihood of the correct labels under the model.
How it works: For a single training example with true class c (one-hot vector p where p_c=1 and others 0), the loss is -log(q_c), where q_c is the model's predicted probability for class c. This penalizes confident wrong predictions heavily: if q_c is near 0, the loss approaches infinity; if q_c is near 1, the loss approaches 0. For a batch, the loss is averaged over all examples. In practice, implementations (e.g., PyTorch's CrossEntropyLoss, TensorFlow's CategoricalCrossentropy) combine softmax activation and negative log-likelihood into a single numerically stable operation to avoid floating-point underflow.
Why it matters: Cross-entropy is the default loss for classification because it provides strong gradients even when predictions are far from correct, unlike squared error which saturates. It is used in virtually every modern classifier: image recognition (ResNet, ViT), language modeling (GPT-4, Llama 3), and speech recognition (Whisper). In autoregressive language models, it is applied token-wise over a vocabulary (e.g., 128k tokens for GPT-4) and averaged across positions.
When used vs alternatives:
- For binary classification, binary cross-entropy (BCE) is used. For multi-label classification (multiple correct labels per sample), binary cross-entropy with sigmoid per class is standard.
- For regression, mean squared error (MSE) or mean absolute error (MAE) are preferred.
- For tasks with severe class imbalance, weighted cross-entropy or focal loss (a modulation of cross-entropy that down-weights easy examples) often perform better. Focal loss was introduced in RetinaNet (2017) for object detection and is now common in long-tail recognition.
- For ranking or contrastive learning, pairwise losses (e.g., triplet loss, InfoNCE) replace cross-entropy.
Common pitfalls:
- Numerical instability: Raw softmax followed by log can produce NaN if probabilities underflow. Modern frameworks fuse softmax + cross-entropy into one function.
- Overconfidence: Cross-entropy encourages models to assign probability 1 to the correct class, which can lead to overfitting and poor calibration. Label smoothing (Szegedy et al., 2016) mitigates this by replacing hard 0/1 targets with smoothed values (e.g., 0.9/0.1), improving generalization. It is standard in models like EfficientNet and PaLM.
- Ignoring label noise: Cross-entropy is not robust to mislabeled examples because it tries to fit every label exactly. Robust alternatives include symmetric cross-entropy, generalized cross-entropy, or using a noise transition matrix.
- Gradient magnitude for easy examples: Easy examples (where q_c is already high) produce tiny gradients, slowing convergence. Focal loss addresses this.
Current state of the art (2026): Cross-entropy remains the foundation for training most large-scale models, but modifications are standard. Label smoothing is applied universally in transformer-based language models. For vision, sigmoid cross-entropy (with binary cross-entropy per class) has become popular in open-vocabulary detectors (e.g., GLIP, Grounding DINO) because it naturally handles multiple labels. In reinforcement learning from human feedback (RLHF), cross-entropy is used in the supervised fine-tuning (SFT) phase, but the RL phase uses preference-based losses (e.g., Bradley-Terry). Research continues on loss functions that improve calibration (e.g., focal loss for better uncertainty estimates) and robustness (e.g., logit adjustment for long-tail data). Overall, cross-entropy is not obsolete but is increasingly augmented with techniques that address its known limitations.