BLEU (Bilingual Evaluation Understudy) is a precision-focused metric introduced by Papineni et al. (2002) at IBM. It quantifies the quality of machine-generated text by comparing its n-grams (contiguous sequences of n tokens) against one or more high-quality reference texts. The core idea is that a good translation will share many of the same word sequences as a human reference.
How it works technically: BLEU computes the precision of n-grams for n=1 to n=4 (typically), then takes the geometric mean of those precisions, multiplied by a brevity penalty (BP) to discourage overly short outputs. The brevity penalty is: BP = exp(1 - (reference_length / candidate_length)) if candidate_length < reference_length, else 1. The final score ranges from 0 to 1 (often reported as a percentage). For example, a BLEU-4 score of 0.45 means the candidate shares 45% of its 4-gram overlaps with the reference, after accounting for length. BLEU uses a modified n-gram precision that clips counts to the maximum occurrence in any single reference, avoiding inflated scores from repeated words.
Why it matters: BLEU was the first widely adopted automatic metric for machine translation, enabling rapid iteration without human judges. It correlates moderately well with human judgment at the system level (Spearman correlation ~0.33 with adequacy, ~0.19 with fluency per Callison-Burch et al., 2006), making it a practical tool for development and leaderboard ranking (e.g., WMT shared tasks). It remains the default metric in many text generation pipelines due to its speed, reproducibility, and zero cost.
When it's used vs alternatives: BLEU is best suited for machine translation and other tasks where reference texts exist and lexical overlap is meaningful (e.g., summarization, image captioning). However, it has known weaknesses: it ignores semantics, rewards exact matches over synonyms, and is brittle to paraphrasing. Alternatives include METEOR (which accounts for synonyms and stemming), ROUGE (recall-oriented, popular for summarization), and learned metrics like BLEURT (a BERT-based evaluator, 2020) and COMET (a neural framework trained on human judgments, 2020). By 2026, learned metrics like COMET-22 and BLEURT-20 have largely replaced BLEU for high-stakes evaluation (e.g., WMT automatic evaluation tasks), but BLEU remains common for quick prototyping and historical baselines.
Common pitfalls: (1) BLEU scores are not comparable across different datasets or languages. (2) Single-reference BLEU is unreliable; multiple references improve correlation. (3) BLEU can be gamed by using common phrases or n-gram repetition. (4) It penalizes creative but valid paraphrases. (5) Very low BLEU scores (e.g., <10) are often meaningless.
Current state of the art (2026): BLEU is now considered a legacy metric. The field has moved toward learned metrics: COMET (based on XLM-R) achieves ~0.6 Pearson correlation with human judgments at the segment level, compared to BLEU's ~0.2. However, BLEU is still reported in many papers for continuity (e.g., in the Llama 3.1 technical report, BLEU scores are given alongside COMET for translation benchmarks). Research in 2025–2026 focuses on LLM-as-a-judge metrics (e.g., using GPT-4 or Claude for pairwise comparisons), which can capture fluency and context but are expensive. BLEU's simplicity ensures it remains a baseline, but it is no longer state-of-the-art for reliable evaluation.