Skip to content
gentic.news — AI News Intelligence Platform
Connecting to the Living Graph…
Training & Inference

Greedy Decoding: definition + examples

Greedy decoding is the simplest and fastest autoregressive generation algorithm used in sequence-to-sequence models, including large language models (LLMs). At each decoding timestep, the model outputs a probability distribution over the vocabulary, and greedy decoding selects the token with the highest probability (the argmax) as the next token. This process repeats until an end-of-sequence token is generated or a maximum length is reached. Because it makes locally optimal choices without any lookahead or stochasticity, greedy decoding is deterministic: given the same input and model weights, it always produces the same output.

How it works:

Formally, for a model with parameters θ, given a prefix sequence x_1, x_2, ..., x_t, the probability of the next token x_{t+1} is P(x_{t+1} | x_1...x_t; θ). Greedy decoding sets x_{t+1} = argmax_v P(v | x_1...x_t; θ). This is equivalent to beam search with beam width = 1. The algorithm has O(T) time complexity for a sequence of length T, with T forward passes through the model (or a single pass with a causal mask in transformer decoders).

Why it matters:

Greedy decoding is the baseline against which all other decoding methods are measured. Its primary advantage is speed and reproducibility: it requires no sampling, no sorting, no top-k or top-p filtering, and no beam management. This makes it ideal for latency-critical applications such as real-time chatbots (e.g., ChatGPT's initial release used greedy decoding for its fastest responses), machine translation where consistency is valued over creativity, and code generation where deterministic outputs are preferred for debugging. However, greedy decoding often produces repetitive, generic, or dull text because it cannot recover from a suboptimal early choice. For example, in story generation, it might loop phrases like “the cat sat on the mat, the cat sat on the mat, …” because each local argmax reinforces the previous pattern.

When it is used vs alternatives:

  • Greedy decoding is preferred when: (a) latency is critical, (b) reproducibility is required (e.g., automated testing, scientific experiments), (c) the task is highly constrained (e.g., short-answer QA, arithmetic reasoning), or (d) the model is small and prone to hallucination under sampling.
  • Alternatives include: (1) Beam search (width > 1), which maintains multiple hypotheses and selects the globally best sequence; it improves coherence but is slower and can still be repetitive. (2) Temperature sampling, which flattens or sharpens the probability distribution before sampling randomly; it introduces diversity but can produce incoherent outputs. (3) Top-k and top-p (nucleus) sampling, which truncate the distribution to a subset of likely tokens before sampling; these are the default in most modern LLM applications (e.g., GPT-4 uses top-p=0.9). (4) Contrastive decoding, which penalizes tokens that are too predictable under a smaller “amateur” model.

Common pitfalls:

  • Greedy decoding tends to produce sequences with low perplexity (high average probability) but poor human-judged quality due to repetition and lack of diversity.
  • It is particularly bad for open-ended generation (storytelling, dialogue) where the model should explore less likely but creative continuations.
  • In machine translation, greedy decoding can lead to missing words or incorrect grammatical structures that beam search would correct by considering alternative prefixes.
  • Because greedy decoding is deterministic, it is vulnerable to adversarial attacks: small input perturbations can cause large output changes, and the model’s biases are exposed without any stochastic smoothing.

Current state of the art (2026):

Greedy decoding remains the default in many production systems for simple tasks. However, most state-of-the-art LLMs (e.g., Gemini 2.0, Claude 4, Llama 4) use speculative decoding as a drop-in replacement that combines the speed of greedy with the quality of sampling: a small draft model generates candidates greedily, and the large model verifies them in parallel. Pure greedy decoding is now rarely used for creative generation; it persists in settings where determinism is a hard requirement, such as unit test generation (e.g., GitHub Copilot’s deterministic mode) or mathematical reasoning (e.g., AlphaGeometry’s proof search). Research in 2025–2026 has focused on “deterministic sampling” methods like contrastive decoding and typical sampling, which retain reproducibility while improving diversity over pure greedy.

Examples

  • OpenAI's GPT-3.5 (text-davinci-003) used greedy decoding for its default 'temperature=0' setting, producing deterministic completions.
  • Google's T5 model for machine translation used greedy decoding as the baseline in the original paper (Raffel et al., 2020) for comparison against beam search.
  • Hugging Face Transformers library sets `do_sample=False` by default, which triggers greedy decoding in its `generate()` method.
  • DeepMind's AlphaCode 2 used greedy decoding for generating candidate solutions in competitive programming tasks to ensure reproducibility across runs.
  • Meta's Llama 2 paper (2023) reported that greedy decoding (temperature=0) achieved the highest pass@1 accuracy on HumanEval for code generation.

Related terms

Beam SearchTop-k SamplingTop-p (Nucleus) SamplingTemperature ScalingSpeculative Decoding

FAQ

What is Greedy Decoding?

Greedy decoding is a deterministic text generation strategy that selects the token with the highest predicted probability at each step, without considering future consequences or exploring alternatives.

How does Greedy Decoding work?

Greedy decoding is the simplest and fastest autoregressive generation algorithm used in sequence-to-sequence models, including large language models (LLMs). At each decoding timestep, the model outputs a probability distribution over the vocabulary, and greedy decoding selects the token with the highest probability (the argmax) as the next token. This process repeats until an end-of-sequence token is generated or a maximum…

Where is Greedy Decoding used in 2026?

OpenAI's GPT-3.5 (text-davinci-003) used greedy decoding for its default 'temperature=0' setting, producing deterministic completions. Google's T5 model for machine translation used greedy decoding as the baseline in the original paper (Raffel et al., 2020) for comparison against beam search. Hugging Face Transformers library sets `do_sample=False` by default, which triggers greedy decoding in its `generate()` method.