Skip to content
gentic.news — AI News Intelligence Platform
Connecting to the Living Graph…
Evaluation

BLEU: definition + examples

BLEU (Bilingual Evaluation Understudy) is a precision-focused metric introduced by Papineni et al. (2002) at IBM. It quantifies the quality of machine-generated text by comparing its n-grams (contiguous sequences of n tokens) against one or more high-quality reference texts. The core idea is that a good translation will share many of the same word sequences as a human reference.

How it works technically: BLEU computes the precision of n-grams for n=1 to n=4 (typically), then takes the geometric mean of those precisions, multiplied by a brevity penalty (BP) to discourage overly short outputs. The brevity penalty is: BP = exp(1 - (reference_length / candidate_length)) if candidate_length < reference_length, else 1. The final score ranges from 0 to 1 (often reported as a percentage). For example, a BLEU-4 score of 0.45 means the candidate shares 45% of its 4-gram overlaps with the reference, after accounting for length. BLEU uses a modified n-gram precision that clips counts to the maximum occurrence in any single reference, avoiding inflated scores from repeated words.

Why it matters: BLEU was the first widely adopted automatic metric for machine translation, enabling rapid iteration without human judges. It correlates moderately well with human judgment at the system level (Spearman correlation ~0.33 with adequacy, ~0.19 with fluency per Callison-Burch et al., 2006), making it a practical tool for development and leaderboard ranking (e.g., WMT shared tasks). It remains the default metric in many text generation pipelines due to its speed, reproducibility, and zero cost.

When it's used vs alternatives: BLEU is best suited for machine translation and other tasks where reference texts exist and lexical overlap is meaningful (e.g., summarization, image captioning). However, it has known weaknesses: it ignores semantics, rewards exact matches over synonyms, and is brittle to paraphrasing. Alternatives include METEOR (which accounts for synonyms and stemming), ROUGE (recall-oriented, popular for summarization), and learned metrics like BLEURT (a BERT-based evaluator, 2020) and COMET (a neural framework trained on human judgments, 2020). By 2026, learned metrics like COMET-22 and BLEURT-20 have largely replaced BLEU for high-stakes evaluation (e.g., WMT automatic evaluation tasks), but BLEU remains common for quick prototyping and historical baselines.

Common pitfalls: (1) BLEU scores are not comparable across different datasets or languages. (2) Single-reference BLEU is unreliable; multiple references improve correlation. (3) BLEU can be gamed by using common phrases or n-gram repetition. (4) It penalizes creative but valid paraphrases. (5) Very low BLEU scores (e.g., <10) are often meaningless.

Current state of the art (2026): BLEU is now considered a legacy metric. The field has moved toward learned metrics: COMET (based on XLM-R) achieves ~0.6 Pearson correlation with human judgments at the segment level, compared to BLEU's ~0.2. However, BLEU is still reported in many papers for continuity (e.g., in the Llama 3.1 technical report, BLEU scores are given alongside COMET for translation benchmarks). Research in 2025–2026 focuses on LLM-as-a-judge metrics (e.g., using GPT-4 or Claude for pairwise comparisons), which can capture fluency and context but are expensive. BLEU's simplicity ensures it remains a baseline, but it is no longer state-of-the-art for reliable evaluation.

Examples

  • Google's GNMT (2016) reported a BLEU score of 24.6 on WMT'14 English-French, close to human parity (25.2).
  • The WMT 2023 shared task used BLEU as a secondary metric, with best systems achieving ~35 BLEU on English-German.
  • OpenAI's GPT-4 technical report (2023) reported BLEU scores on machine translation benchmarks alongside human evaluation.
  • Facebook's M2M-100 (2020) achieved BLEU scores ranging from 15.2 (English to Swahili) to 38.5 (English to French).
  • In the Llama 3.1 (2024) paper, BLEU was reported on the WMT'14 English-German test set (score ~30) as a baseline, with COMET scores preferred for final analysis.

Related terms

ROUGEMETEORCOMETPerplexityHuman Evaluation

Latest news mentioning BLEU

FAQ

What is BLEU?

BLEU (Bilingual Evaluation Understudy) is an automatic metric that measures the overlap of n-grams between machine-generated text and one or more reference translations, commonly used for machine translation and text generation evaluation.

How does BLEU work?

BLEU (Bilingual Evaluation Understudy) is a precision-focused metric introduced by Papineni et al. (2002) at IBM. It quantifies the quality of machine-generated text by comparing its n-grams (contiguous sequences of n tokens) against one or more high-quality reference texts. The core idea is that a good translation will share many of the same word sequences as a human reference. **How…

Where is BLEU used in 2026?

Google's GNMT (2016) reported a BLEU score of 24.6 on WMT'14 English-French, close to human parity (25.2). The WMT 2023 shared task used BLEU as a secondary metric, with best systems achieving ~35 BLEU on English-German. OpenAI's GPT-4 technical report (2023) reported BLEU scores on machine translation benchmarks alongside human evaluation.