Sampling temperature is a hyperparameter used during the decoding phase of autoregressive language models (e.g., GPT-4, Llama 3, Claude) to modulate the probability distribution over the next token. Technically, it works by dividing the logits (the raw, unnormalized scores output by the model’s final linear layer) by the temperature value T before applying the softmax function: softmax(logits / T). When T = 1, the distribution is unchanged. When T < 1, the logits are scaled up, making the softmax output more peaked—high-probability tokens become even more likely, and low-probability tokens become vanishingly rare. At T → 0, the distribution approaches a one-hot vector (argmax), yielding deterministic, greedy decoding. When T > 1, the logits are scaled down, flattening the distribution so that tokens with lower original probability have a relatively higher chance of being selected. At very high T (e.g., 5.0), the distribution approaches uniform, producing nearly random output.
Why it matters: Temperature is the primary lever for controlling the exploration-exploitation trade-off in text generation. Low temperature is preferred for factual, precise, or code-generation tasks where repeatability and accuracy are paramount (e.g., math problem solving, legal document drafting). High temperature is used for creative writing, brainstorming, or dialogue where novelty and variety are desired. The choice of temperature interacts strongly with other sampling strategies—top-k, top-p (nucleus sampling), and min-p—and is often tuned in tandem with them. For instance, typical configurations in production systems: GPT-4 often uses T=0.7 with top-p=0.9 for general chat; Llama 3.1 405B instruction-tuned models default to T=0.6 for instruction following; code generation models like CodeLlama often use T=0.2 or lower.
Common pitfalls: (1) Using temperature alone without adjusting top-k or top-p can lead to incoherent outputs at high T because low-probability tokens are sampled uniformly from the entire vocabulary. (2) Believing temperature controls “creativity” directly—it only controls randomness; true stylistic diversity also depends on training data, prompt, and model size. (3) Applying temperature during training (it is a decoding-time parameter only). (4) Setting T=0 for all use cases eliminates any chance of alternative correct answers, which can be harmful in open-ended tasks.
Current state of the art (2026): Temperature remains a universal decoding parameter, but recent research has introduced adaptive temperature scheduling—for example, dynamic temperature based on token entropy (e.g., “entropy-based temperature scaling” in the 2025 paper *Adaptive Sampling for LLMs* by Li et al.) or per-head temperature in mixture-of-experts models. Also, contrastive decoding and typical sampling have emerged as alternatives that sometimes outperform temperature-based sampling for factuality. In production, most LLM APIs (OpenAI, Anthropic, Google, Mistral) expose temperature as a user-adjustable parameter, with recommended ranges between 0 and 2. The open-source community has also adopted “temperature” in inference engines like vLLM, TensorRT-LLM, and llama.cpp, where it is applied after logit processing.