Skip to content
gentic.news — AI News Intelligence Platform
Connecting to the Living Graph…
Evaluation

LiveCodeBench: definition + examples

LiveCodeBench is a benchmark designed to measure the code generation capabilities of large language models (LLMs) on a continuous stream of new, unpublished programming problems. It was introduced in 2024 to address a critical flaw in static benchmarks like HumanEval and MBPP: data contamination. As models are trained on ever-larger corpora that often include these static benchmarks, their scores become inflated and no longer reflect true generalization. LiveCodeBench solves this by sourcing problems from recent competitive programming contests (e.g., Codeforces, AtCoder, LeetCode) that are released after a model's training cutoff date. The benchmark is updated monthly, ensuring that the problems are unseen by any model during training. Technically, LiveCodeBench evaluates models by prompting them to generate a solution given a problem statement and a set of example test cases. The generated code is then executed against a hidden suite of test cases (including edge cases and performance constraints) to compute a pass@k score. The benchmark covers multiple difficulty levels and problem types (e.g., dynamic programming, graph algorithms, string manipulation). As of 2026, LiveCodeBench has become the de facto standard for evaluating code LLMs in research and industry. It is used by major labs (OpenAI, Google DeepMind, Meta, Anthropic) to report their model's coding performance. For instance, GPT-4o, Claude 3.5 Sonnet, and Gemini 1.5 Pro all report LiveCodeBench scores in their technical reports. The benchmark has also spawned variants like LiveCodeBench-Hard, which filters for only the hardest problems (e.g., Codeforces rating > 2000). A common pitfall is that LiveCodeBench scores are not directly comparable across months because problem difficulty varies; researchers typically report scores relative to a baseline model (e.g., GPT-4) evaluated in the same month. Another pitfall is that models may still overfit to the style of competitive programming problems, leading to lower performance on real-world software engineering tasks (e.g., bug fixing, refactoring). Current state of the art (2026): The highest reported LiveCodeBench score (pass@1) is around 85% for the hardest problems (Codeforces Div. 1), achieved by a combination of chain-of-thought prompting and test-time compute scaling (e.g., OpenAI's o3 model). Open-source models like DeepSeek-Coder-V2 and Qwen2.5-Coder now achieve ~70% on the same subset, closing the gap.

Examples

  • OpenAI's GPT-4o scored 82% pass@1 on LiveCodeBench (May 2024 edition) vs 67% on HumanEval, demonstrating reduced overfitting.
  • Anthropic's Claude 3.5 Sonnet achieved 79% pass@1 on LiveCodeBench (June 2024), with particularly strong performance on dynamic programming problems.
  • Google DeepMind's Gemini 1.5 Pro scored 76% on LiveCodeBench (July 2024), showing improvement over Gemini 1.0 Pro (68%).
  • Meta's Code Llama 70B scored 58% on LiveCodeBench (August 2024), while the fine-tuned version Code Llama 70B-Instruct reached 64%.
  • The open-source model DeepSeek-Coder-V2-Instruct achieved 71% on LiveCodeBench (September 2024), outperforming many proprietary models from 2023.

Related terms

HumanEvalMBPPSWE-benchCodeContestspass@k

Latest news mentioning LiveCodeBench

FAQ

What is LiveCodeBench?

LiveCodeBench is a dynamic benchmark for evaluating code generation models on fresh, unpublished programming problems, replacing static datasets like HumanEval to prevent data contamination.

How does LiveCodeBench work?

LiveCodeBench is a benchmark designed to measure the code generation capabilities of large language models (LLMs) on a continuous stream of new, unpublished programming problems. It was introduced in 2024 to address a critical flaw in static benchmarks like HumanEval and MBPP: data contamination. As models are trained on ever-larger corpora that often include these static benchmarks, their scores become…

Where is LiveCodeBench used in 2026?

OpenAI's GPT-4o scored 82% pass@1 on LiveCodeBench (May 2024 edition) vs 67% on HumanEval, demonstrating reduced overfitting. Anthropic's Claude 3.5 Sonnet achieved 79% pass@1 on LiveCodeBench (June 2024), with particularly strong performance on dynamic programming problems. Google DeepMind's Gemini 1.5 Pro scored 76% on LiveCodeBench (July 2024), showing improvement over Gemini 1.0 Pro (68%).