Top-k sampling is a stochastic decoding strategy used during autoregressive text generation in large language models (LLMs). It addresses the tension between generating creative, varied outputs and maintaining grammatical coherence. The core idea is simple: at each generation step, instead of sampling from the full vocabulary (which may include many low-probability, nonsensical tokens), the model considers only the k tokens with the highest predicted probabilities. The probabilities of these k tokens are then renormalized (divided by their sum) to form a new probability distribution, and the next token is sampled from this truncated distribution.
How it works technically: Given a language model that outputs a probability distribution P(x_t | x_<t) over the vocabulary V for the next token x_t, top-k sampling first sorts all tokens by their probability in descending order. It selects the set V_top = {v_1, v_2, ..., v_k} of the k most probable tokens. The new sampling distribution is P'(x_t = v_i) = P(v_i) / Σ_{j=1}^{k} P(v_j) for v_i in V_top, and 0 otherwise. A random draw from this renormalized distribution yields the next token. The hyperparameter k controls the trade-off: a small k (e.g., 10) makes generation more deterministic and conservative, while a large k (e.g., 100) allows more diverse but potentially lower-quality tokens. Common values in practice range from 10 to 100, depending on model size and task.
Why it matters: Top-k sampling was introduced as an improvement over pure greedy decoding (which always picks the most likely token, leading to repetitive and dull text) and pure ancestral sampling (which samples from the full distribution, often producing incoherent or irrelevant tokens). It became widely adopted after being popularized by the GPT-2 paper (Radford et al., 2019), which used k=40 for story generation. Top-k sampling is a foundational technique in the broader family of nucleus sampling (top-p sampling), which dynamically adjusts the number of candidate tokens based on a cumulative probability threshold. While top-k uses a fixed number of tokens, top-p adapts to the shape of the probability distribution, often yielding better results for models with varying confidence levels.
When it's used vs alternatives: Top-k is commonly used in interactive applications like chatbots (e.g., character.ai, ChatGPT's earlier versions) and creative writing tools (e.g., Sudowrite, Jasper) where a balance of predictability and surprise is desired. It is often combined with temperature scaling: temperature τ > 1 flattens the distribution before top-k truncation, increasing diversity; τ < 1 sharpens it, reducing randomness. Alternatives include greedy decoding (for tasks requiring high accuracy like mathematical reasoning), beam search (for machine translation or summarization where multiple candidates are explored), and top-p sampling (which is now more common in production LLMs like GPT-4, Claude 3, and Llama 3). In practice, many frameworks (e.g., Hugging Face Transformers) allow combining top-k with top-p: the model first applies top-k truncation, then further filters by top-p, providing fine-grained control.
Common pitfalls: A fixed k can be suboptimal across different contexts. For example, in a highly confident prediction (e.g., the word "the" after "The cat sat on"), the top 5 tokens might all be reasonable, but top-k still includes low-probability tokens if the distribution is flat. Conversely, in a situation where only 3 tokens are plausible, top-k=50 introduces many irrelevant tokens. This is why top-p often outperforms top-k in modern LLM deployments. Another pitfall is setting k too high for factual tasks (e.g., question answering), which can introduce hallucinations, or too low for creative tasks, leading to boring outputs.
Current state of the art (2026): Top-k sampling is now considered a legacy technique in many state-of-the-art LLM pipelines. Most frontier models (e.g., GPT-4o, Gemini 2.0, Claude 4, Llama 4) use top-p sampling with dynamic thresholds (typically p=0.9) as the default stochastic decoding strategy, sometimes combined with temperature. However, top-k remains useful in specialized contexts: for example, in speculative decoding (Leviathan et al., 2023) where a draft model uses top-k to quickly generate candidate tokens, and in mixture-of-experts (MoE) models where different experts may benefit from different k values. Research in 2024-2026 has explored adaptive top-k, where k is determined per step based on entropy or confidence metrics (e.g., entropy-aware sampling), and contrastive decoding (Li et al., 2023) which penalizes tokens preferred by a weaker model. Despite these advances, top-k sampling remains a standard baseline in textbooks and introductory courses, and is supported in every major inference library (Hugging Face, vLLM, TensorRT-LLM, llama.cpp).
In summary, top-k sampling is a simple yet effective technique that was instrumental in moving LLM generation from deterministic to stochastic outputs, and while it has been largely superseded by top-p in production, it remains a fundamental concept in the AI/ML lexicon.