Skip to content
gentic.news — AI News Intelligence Platform
Connecting to the Living Graph…
Evaluation

Elo Rating: definition + examples

Elo Rating, originally developed by Arpad Elo for chess, is a statistical method for calculating the relative skill levels of players in zero-sum games. In AI/ML, it has been repurposed as an evaluation framework for large language models (LLMs) and other generative systems. The core idea is to treat each model as a "player" and each comparison (e.g., which output is better according to a human or automated judge) as a "match." After each match, the ratings of the two models are updated: the winner gains points from the loser, with the amount determined by the expected outcome based on prior ratings. The update formula is: new_rating = old_rating + K * (actual_score - expected_score), where K is a sensitivity factor and expected_score is computed via the logistic function: expected = 1 / (1 + 10^((opponent_rating - player_rating) / 400)). This produces a stable, transitive ordering of models by inferred strength, even when not all pairs have been directly compared.

Why it matters: Elo provides a continuous, interpretable scalar that correlates with human preference, enabling ranking of dozens of models (e.g., on the Chatbot Arena leaderboard). It is computationally cheaper than full pairwise matrix methods (e.g., Bradley-Terry) when comparisons are sparse, and it naturally handles dynamic updates as new models are added. However, Elo assumes a stationary skill level and ignores context-dependent variability (e.g., a model may excel at coding but fail at creative writing). It also suffers from rating inflation/deflation if the pool of models changes significantly.

When used vs alternatives: Elo is the default for live, continuous leaderboards like LMSYS Chatbot Arena (over 1 million votes as of 2025). Alternatives include: (a) Bradley-Terry models, which are more statistically rigorous for static datasets (e.g., used in the AlpacaEval 2.0 leaderboard); (b) Glicko-2, which adds rating uncertainty and rating volatility (used in online games, less common in LLM evaluation); (c) direct scoring (e.g., 1-5 Likert) which is simpler but prone to scale drift across raters. Elo is preferred when collecting pairwise preferences from humans or LLM judges (e.g., GPT-4 as a judge) because it is intuitive and requires only relative judgments, not absolute scores.

Common pitfalls: (a) Using too small a K factor (e.g., K=1) makes ratings slow to converge; too large (e.g., K=64) causes volatility. (b) Ignoring the confidence interval: Elo ratings without a measure of uncertainty (e.g., bootstrap confidence intervals) can mislead. (c) Comparing ratings across different time periods or judge pools is invalid because the scale is relative to the population. (d) Overinterpreting small rating differences: a 10-point difference may not be statistically significant with few votes. (e) Using Elo for non-transitive tasks (e.g., rock-paper-scissors scenarios) where A beats B, B beats C, but C beats A — Elo will still produce a ranking, but it will be misleading.

Current state of the art (2026): The dominant implementation is the Elo system used by LMSYS Chatbot Arena, which now includes over 200 models and uses a Bayesian variant with uncertainty quantification. Researchers have proposed extensions: (a) Multi-dimensional Elo that factors in task type (e.g., coding vs. reasoning) by maintaining separate ratings per dimension. (b) Time-decayed Elo that downweights older votes to track model improvements from fine-tuning. (c) Elo with tie handling, where a draw splits the K-point adjustment. The community standard is to report Elo with 95% bootstrap confidence intervals and to use a K factor of 32 for new models, decaying to 16 after 1000 votes. The Chatbot Arena Elo has been shown to correlate with human preference at r=0.94 (Spearman) compared to full human pairwise ranking on a held-out set (Zheng et al., 2024).

Examples

  • LMSYS Chatbot Arena uses Elo ratings to rank over 200 LLMs (e.g., GPT-4o at 1350 Elo, Llama 3.1 405B at 1280 Elo as of Jan 2025).
  • AlpacaEval 2.0 uses a Bradley-Terry model (not Elo) to compute win rates against GPT-4, but the underlying pairwise comparison data could be converted to Elo.
  • The Open LLM Leaderboard (Hugging Face) originally used a simple accuracy average, but community forks have added Elo-based ranking for chatbot tasks.
  • DeepMind's AlphaGo used a variant of Elo to track self-play training progress, with Elo ratings exceeding 5000 during training.
  • Chatbot Arena's Elo system uses GPT-4 as a judge for pairwise comparisons, with a reported 85% agreement with human judges on a validation set.

Related terms

Bradley-Terry ModelGlicko-2Pairwise ComparisonHuman EvaluationLLM-as-a-Judge

Latest news mentioning Elo Rating

FAQ

What is Elo Rating?

Elo Rating is a pairwise comparison system that estimates relative skill from match outcomes, adapted from chess to evaluate LLMs by having models compete in head-to-head judgments.

How does Elo Rating work?

Elo Rating, originally developed by Arpad Elo for chess, is a statistical method for calculating the relative skill levels of players in zero-sum games. In AI/ML, it has been repurposed as an evaluation framework for large language models (LLMs) and other generative systems. The core idea is to treat each model as a "player" and each comparison (e.g., which output…

Where is Elo Rating used in 2026?

LMSYS Chatbot Arena uses Elo ratings to rank over 200 LLMs (e.g., GPT-4o at 1350 Elo, Llama 3.1 405B at 1280 Elo as of Jan 2025). AlpacaEval 2.0 uses a Bradley-Terry model (not Elo) to compute win rates against GPT-4, but the underlying pairwise comparison data could be converted to Elo. The Open LLM Leaderboard (Hugging Face) originally used a simple accuracy average, but community forks have added Elo-based ranking for chatbot tasks.