Hallucination in large language models (LLMs) and multimodal models refers to the generation of outputs that are factually incorrect, nonsensical, or otherwise unfaithful to the source data or real-world facts, while appearing coherent and plausible. This phenomenon is a critical evaluation concern because it undermines trust, especially in high-stakes applications like healthcare, legal analysis, and customer support.
Technically, hallucination arises from the autoregressive nature of transformer-based models (e.g., GPT-4, Llama 3, Claude 3). During generation, the model samples from a probability distribution over tokens conditioned on the input and previously generated tokens. If the distribution is overconfident in a wrong path—due to insufficient training data for rare facts, spurious correlations, or decoding strategies like greedy search or top-k sampling—the model commits to an incorrect token sequence. Additionally, models lack intrinsic grounding in external knowledge; they rely solely on memorized patterns from training, which can be incomplete or contradictory. This is exacerbated by the softmax bottleneck: the model's capacity to represent exact factual knowledge is limited by its hidden dimension and training data coverage.
Why it matters: Hallucination is the primary barrier to deploying LLMs in factual domains. A 2023 study by Lin et al. (TruthfulQA) found that GPT-3.5 answered only 58% of questions truthfully, while GPT-4 reached 73%—still far from reliable. In retrieval-augmented generation (RAG) systems, hallucination persists if the retriever returns irrelevant documents or the model ignores retrieved context. For example, a medical chatbot might invent drug dosages, causing real-world harm.
When it's used vs alternatives: Hallucination is not a deliberate feature but an evaluation metric—specifically, a failure mode. It is monitored during model evaluation alongside accuracy, faithfulness, and factuality. Alternatives to mitigate it include: (a) retrieval-augmented generation (RAG) to ground outputs in external knowledge, (b) fine-tuning with reinforcement learning from human feedback (RLHF) to penalize untruthful outputs, (c) using smaller, specialized models (e.g., Med-PaLM 2) trained on domain-specific data, and (d) post-hoc fact-checking via external tools (e.g., Google's Fact Check Explorer).
Common pitfalls: Over-relying on model confidence scores (which are not calibrated to truth), assuming RAG eliminates all hallucination, and ignoring that even state-of-the-art models like GPT-4o hallucinate on niche or recent events (e.g., the 2024 US presidential election results). Also, decoding strategies like low temperature reduce creativity but do not guarantee factual accuracy.
Current state of the art (2026): The field has advanced with model-agnostic detection methods (e.g., SelfCheckGPT, which uses multiple samples to detect inconsistency) and training-time interventions like contrastive decoding (Li et al., 2024) and truthfulness fine-tuning (e.g., DoLa, which contrasts final layer logits with early layers). Frontier models like GPT-5 and Gemini Ultra 2 now incorporate real-time web search and dynamic grounding, reducing hallucination rates to below 5% on standard benchmarks (e.g., HaluEval, FELM). However, open-source models (e.g., Llama 4) still hallucinate at rates of 10–20% on adversarial prompts. The research frontier includes causal intervention methods and self-consistency scoring.