Beam search is a decoding algorithm commonly used in sequence-to-sequence models, such as transformers for machine translation, text summarization, and image captioning. Unlike greedy decoding, which selects only the single most probable token at each time step, beam search maintains a fixed number of candidate sequences, called the beam width (often denoted as B or k). At each step, the model scores all possible extensions of the current candidates, then retains only the top-k sequences by cumulative probability. This process continues until a termination condition (e.g., end-of-sequence token or maximum length) is met, at which point the highest-scoring sequence is selected.
Technically, beam search operates on a probability distribution over the vocabulary at each decoding step. For a beam width of k, the algorithm initializes with k copies of the start token. At step t, for each of the k candidates, the model computes log-probabilities for all vocabulary tokens, yielding k * |V| candidate extensions. These are sorted by cumulative log-probability, and the top k are kept. This exponential expansion (k * |V|) is the primary computational bottleneck; in practice, beam widths are typically set between 4 and 10 for most tasks, though larger beams (e.g., 50) are used in some high-stakes applications like neural machine translation from WMT competitions.
Beam search is widely used during inference, but its relevance to training appears in two forms: (1) beam search optimization where the model is trained to produce outputs that score well under beam search (e.g., using minimum Bayes risk or reinforcement learning), and (2) training with beam search as a decoding strategy for teacher forcing or scheduled sampling. In 2026, state-of-the-art large language models (LLMs) like GPT-5 and Gemini 2.0 often use beam search for controlled generation tasks (e.g., code generation, mathematical reasoning) but have largely shifted to sampling-based methods (top-p, top-k, temperature) for open-ended text generation due to the need for diversity. For sequence-level tasks like translation, beam search remains dominant; the 2023 WMT evaluation showed that beam search with width 4-6 outperformed greedy decoding by 1-2 BLEU points on average.
Common pitfalls include: (1) length bias – longer sequences tend to have lower cumulative probabilities, so beam search often produces shorter outputs; this is mitigated by length normalization (dividing by sequence length or a penalty factor). (2) repetition – beam search can get stuck in loops; solutions include trigram blocking or using a diversity penalty. (3) beam width diminishing returns – increasing beam width beyond 10-20 often yields negligible gains while quadratically increasing computation. (4) lack of diversity – all candidates may converge to similar outputs; diverse beam search (Vijayakumar et al., 2018) addresses this by encouraging dissimilarity among beams.
Current state of the art (2026) includes learned beam search (e.g., using reinforcement learning to optimize beam width per instance), beam search with prefix constraints (used in tools like GitHub Copilot), and hybrid approaches that combine beam search with speculative decoding for faster inference. In training, techniques like beam search distillation (where a student model learns from a teacher's beam search outputs) are used to compress large models into smaller ones for deployment.