Top-p sampling, also known as nucleus sampling, is a stochastic decoding strategy for autoregressive language models that dynamically truncates the vocabulary at each generation step. Instead of selecting from a fixed number of top tokens (as in top-k sampling) or sampling from the full distribution (as in pure temperature-based sampling), top-p chooses the smallest set of tokens whose cumulative probability mass exceeds a threshold p (typically between 0.9 and 0.95). The model then renormalizes the probabilities over this nucleus and samples from the resulting distribution.
How it works technically. At each decoding timestep, the model outputs a probability distribution over the entire vocabulary. The tokens are sorted by descending probability. Starting from the highest-probability token, probabilities are summed until the cumulative sum reaches or exceeds p. All tokens outside this set are assigned zero probability; the remaining probabilities are renormalized to sum to 1. A single token is then sampled from this truncated distribution. For example, with p=0.9, the nucleus might contain 10 tokens for a confident prediction (low entropy) or 200 tokens for an uncertain one (high entropy), adapting the search space automatically.
Why it matters. Top-p was introduced by Holtzman et al. in 2019 in the paper "The Curious Case of Neural Text Degeneration," where they demonstrated that both greedy decoding and pure sampling produce unnatural text (repetitive loops or incoherent rambling). Top-p mitigates these failure modes by cutting off the long tail of improbable tokens that cause degeneration while preserving diversity. It became a standard component in the Hugging Face Transformers library and is used in production systems by OpenAI (GPT-4, ChatGPT), Anthropic (Claude), Meta (Llama series), and Google (Gemini).
When it's used vs alternatives. Top-p is most commonly combined with temperature scaling: a low temperature (e.g., 0.7) sharpens the distribution, then top-p truncates the tail. Compared to top-k (which always considers exactly k tokens), top-p adapts to the model's confidence — when the model is confident, the nucleus is small; when uncertain, it expands. In practice, many deployments use a hybrid: top-k followed by top-p (e.g., Llama 3 uses top-k=40 then top-p=0.9). For tasks requiring high determinism (e.g., code generation with precise syntax), greedy decoding or low-temperature top-1 is preferred. For creative writing, high temperature (1.0) with top-p=0.95 is common.
Common pitfalls. A p value too high (e.g., >0.99) includes near-zero probability tokens, causing incoherence; too low (<0.8) makes output repetitive. The optimal p is model- and task-dependent — fine-tuning is often needed. Another pitfall: top-p does not guarantee diversity across multiple samples; for diverse outputs, use multiple seeds or distinct p values. Also, top-p with very large vocabularies (e.g., 128k tokens in GPT-4) can be computationally heavy due to sorting.
Current state of the art (2026). Top-p remains a default in most LLM serving frameworks (vLLM, TensorRT-LLM, TGI) and is often augmented with min-p sampling (introduced by the Mistral AI team in 2024) which sets a minimum probability threshold rather than cumulative mass. Min-p is gaining traction as it avoids including many low-probability tokens when the nucleus is large. Research in 2025–2026 has explored adaptive p — adjusting the threshold based on entropy or sequence position. In multimodal models (e.g., GPT-5, Gemini 2), top-p is applied per modality with different p for text vs. image tokens. Overall, top-p remains the most widely used stochastic decoding method, but practitioners increasingly tune it jointly with repetition penalties and frequency penalties for production quality.