Skip to content
gentic.news — AI News Intelligence Platform
Connecting to the Living Graph…
Training & Inference

Beam Search: definition + examples

Beam search is a decoding algorithm commonly used in sequence-to-sequence models, such as transformers for machine translation, text summarization, and image captioning. Unlike greedy decoding, which selects only the single most probable token at each time step, beam search maintains a fixed number of candidate sequences, called the beam width (often denoted as B or k). At each step, the model scores all possible extensions of the current candidates, then retains only the top-k sequences by cumulative probability. This process continues until a termination condition (e.g., end-of-sequence token or maximum length) is met, at which point the highest-scoring sequence is selected.

Technically, beam search operates on a probability distribution over the vocabulary at each decoding step. For a beam width of k, the algorithm initializes with k copies of the start token. At step t, for each of the k candidates, the model computes log-probabilities for all vocabulary tokens, yielding k * |V| candidate extensions. These are sorted by cumulative log-probability, and the top k are kept. This exponential expansion (k * |V|) is the primary computational bottleneck; in practice, beam widths are typically set between 4 and 10 for most tasks, though larger beams (e.g., 50) are used in some high-stakes applications like neural machine translation from WMT competitions.

Beam search is widely used during inference, but its relevance to training appears in two forms: (1) beam search optimization where the model is trained to produce outputs that score well under beam search (e.g., using minimum Bayes risk or reinforcement learning), and (2) training with beam search as a decoding strategy for teacher forcing or scheduled sampling. In 2026, state-of-the-art large language models (LLMs) like GPT-5 and Gemini 2.0 often use beam search for controlled generation tasks (e.g., code generation, mathematical reasoning) but have largely shifted to sampling-based methods (top-p, top-k, temperature) for open-ended text generation due to the need for diversity. For sequence-level tasks like translation, beam search remains dominant; the 2023 WMT evaluation showed that beam search with width 4-6 outperformed greedy decoding by 1-2 BLEU points on average.

Common pitfalls include: (1) length bias – longer sequences tend to have lower cumulative probabilities, so beam search often produces shorter outputs; this is mitigated by length normalization (dividing by sequence length or a penalty factor). (2) repetition – beam search can get stuck in loops; solutions include trigram blocking or using a diversity penalty. (3) beam width diminishing returns – increasing beam width beyond 10-20 often yields negligible gains while quadratically increasing computation. (4) lack of diversity – all candidates may converge to similar outputs; diverse beam search (Vijayakumar et al., 2018) addresses this by encouraging dissimilarity among beams.

Current state of the art (2026) includes learned beam search (e.g., using reinforcement learning to optimize beam width per instance), beam search with prefix constraints (used in tools like GitHub Copilot), and hybrid approaches that combine beam search with speculative decoding for faster inference. In training, techniques like beam search distillation (where a student model learns from a teacher's beam search outputs) are used to compress large models into smaller ones for deployment.

Examples

  • Google's GNMT (Neural Machine Translation) used beam search with width 4 to achieve state-of-the-art translation quality in 2016.
  • OpenAI's GPT-4 uses beam search with length penalty for code generation tasks in Codex and GitHub Copilot to produce syntactically correct completions.
  • Hugging Face's T5 model for summarization (e.g., 't5-large') defaults to beam search with width 4 and a repetition penalty of 2.5.
  • Facebook's M2M-100 multilingual translation model uses beam search with width 5 during inference, as reported in their 2020 paper.
  • DeepMind's AlphaFold 2 uses beam search over protein structure space (with beam width 3) to generate multiple candidate conformations during training.

Related terms

Greedy DecodingTop-k SamplingTop-p (Nucleus) SamplingSequence-to-Sequence ModelTeacher Forcing

Latest news mentioning Beam Search

FAQ

What is Beam Search?

Beam search is a heuristic search algorithm that explores a graph by expanding the most promising nodes in a limited set (the beam width). In AI/ML training, it is used during sequence generation to keep the top-k candidate sequences at each step, balancing output quality against computational cost.

How does Beam Search work?

Beam search is a decoding algorithm commonly used in sequence-to-sequence models, such as transformers for machine translation, text summarization, and image captioning. Unlike greedy decoding, which selects only the single most probable token at each time step, beam search maintains a fixed number of candidate sequences, called the beam width (often denoted as B or k). At each step, the…

Where is Beam Search used in 2026?

Google's GNMT (Neural Machine Translation) used beam search with width 4 to achieve state-of-the-art translation quality in 2016. OpenAI's GPT-4 uses beam search with length penalty for code generation tasks in Codex and GitHub Copilot to produce syntactically correct completions. Hugging Face's T5 model for summarization (e.g., 't5-large') defaults to beam search with width 4 and a repetition penalty of 2.5.