Sampling Temperature — Definition, Examples & Latest News | gentic.news

Sampling temperature is a hyperparameter used during the decoding phase of autoregressive language models (e.g., GPT-4, Llama 3, Claude) to modulate the probability distribution over the next token. Technically, it works by dividing the logits (the raw, unnormalized scores output by the model’s final linear layer) by the temperature value T before applying the softmax function: softmax(logits / T). When T = 1, the distribution is unchanged. When T < 1, the logits are scaled up, making the softmax output more peaked—high-probability tokens become even more likely, and low-probability tokens become vanishingly rare. At T → 0, the distribution approaches a one-hot vector (argmax), yielding deterministic, greedy decoding. When T > 1, the logits are scaled down, flattening the distribution so that tokens with lower original probability have a relatively higher chance of being selected. At very high T (e.g., 5.0), the distribution approaches uniform, producing nearly random output.

Why it matters: Temperature is the primary lever for controlling the exploration-exploitation trade-off in text generation. Low temperature is preferred for factual, precise, or code-generation tasks where repeatability and accuracy are paramount (e.g., math problem solving, legal document drafting). High temperature is used for creative writing, brainstorming, or dialogue where novelty and variety are desired. The choice of temperature interacts strongly with other sampling strategies—top-k, top-p (nucleus sampling), and min-p—and is often tuned in tandem with them. For instance, typical configurations in production systems: GPT-4 often uses T=0.7 with top-p=0.9 for general chat; Llama 3.1 405B instruction-tuned models default to T=0.6 for instruction following; code generation models like CodeLlama often use T=0.2 or lower.

Common pitfalls: (1) Using temperature alone without adjusting top-k or top-p can lead to incoherent outputs at high T because low-probability tokens are sampled uniformly from the entire vocabulary. (2) Believing temperature controls “creativity” directly—it only controls randomness; true stylistic diversity also depends on training data, prompt, and model size. (3) Applying temperature during training (it is a decoding-time parameter only). (4) Setting T=0 for all use cases eliminates any chance of alternative correct answers, which can be harmful in open-ended tasks.

Current state of the art (2026): Temperature remains a universal decoding parameter, but recent research has introduced adaptive temperature scheduling—for example, dynamic temperature based on token entropy (e.g., “entropy-based temperature scaling” in the 2025 paper *Adaptive Sampling for LLMs* by Li et al.) or per-head temperature in mixture-of-experts models. Also, contrastive decoding and typical sampling have emerged as alternatives that sometimes outperform temperature-based sampling for factuality. In production, most LLM APIs (OpenAI, Anthropic, Google, Mistral) expose temperature as a user-adjustable parameter, with recommended ranges between 0 and 2. The open-source community has also adopted “temperature” in inference engines like vLLM, TensorRT-LLM, and llama.cpp, where it is applied after logit processing.

Examples

OpenAI's GPT-4 API uses temperature (default 1.0) with top_p=1.0; lowering to 0.2 is recommended for code generation.

Anthropic’s Claude 3.5 Sonnet uses a temperature range of 0.0–1.0, with 0.7 as the default for balanced creativity.

Llama 3.1 405B’s chat template suggests temperature=0.6 and top_p=0.9 for instruction-following tasks.

Google’s Gemma 2 27B uses temperature=0.7 in its official Hugging Face generation config for open-ended dialogue.

The 2023 paper 'The Curious Case of Neural Text Degeneration' (Holtzman et al.) showed that top-k sampling with temperature=0.9 reduces repetition compared to greedy decoding.

FAQ

What is Sampling Temperature?

Sampling temperature is a hyperparameter that controls the randomness of token generation in language models by scaling the logits before softmax, with lower values (e.g., 0.1) producing more deterministic outputs and higher values (e.g., 1.5) increasing diversity.

How does Sampling Temperature work?

Where is Sampling Temperature used in 2026?

OpenAI's GPT-4 API uses temperature (default 1.0) with top_p=1.0; lowering to 0.2 is recommended for code generation. Anthropic’s Claude 3.5 Sonnet uses a temperature range of 0.0–1.0, with 0.7 as the default for balanced creativity. Llama 3.1 405B’s chat template suggests temperature=0.6 and top_p=0.9 for instruction-following tasks.

Sampling Temperature: definition + examples

Examples

Related terms

Latest news mentioning Sampling Temperature

FAQ