LiveCodeBench is a benchmark designed to measure the code generation capabilities of large language models (LLMs) on a continuous stream of new, unpublished programming problems. It was introduced in 2024 to address a critical flaw in static benchmarks like HumanEval and MBPP: data contamination. As models are trained on ever-larger corpora that often include these static benchmarks, their scores become inflated and no longer reflect true generalization. LiveCodeBench solves this by sourcing problems from recent competitive programming contests (e.g., Codeforces, AtCoder, LeetCode) that are released after a model's training cutoff date. The benchmark is updated monthly, ensuring that the problems are unseen by any model during training. Technically, LiveCodeBench evaluates models by prompting them to generate a solution given a problem statement and a set of example test cases. The generated code is then executed against a hidden suite of test cases (including edge cases and performance constraints) to compute a pass@k score. The benchmark covers multiple difficulty levels and problem types (e.g., dynamic programming, graph algorithms, string manipulation). As of 2026, LiveCodeBench has become the de facto standard for evaluating code LLMs in research and industry. It is used by major labs (OpenAI, Google DeepMind, Meta, Anthropic) to report their model's coding performance. For instance, GPT-4o, Claude 3.5 Sonnet, and Gemini 1.5 Pro all report LiveCodeBench scores in their technical reports. The benchmark has also spawned variants like LiveCodeBench-Hard, which filters for only the hardest problems (e.g., Codeforces rating > 2000). A common pitfall is that LiveCodeBench scores are not directly comparable across months because problem difficulty varies; researchers typically report scores relative to a baseline model (e.g., GPT-4) evaluated in the same month. Another pitfall is that models may still overfit to the style of competitive programming problems, leading to lower performance on real-world software engineering tasks (e.g., bug fixing, refactoring). Current state of the art (2026): The highest reported LiveCodeBench score (pass@1) is around 85% for the hardest problems (Codeforces Div. 1), achieved by a combination of chain-of-thought prompting and test-time compute scaling (e.g., OpenAI's o3 model). Open-source models like DeepSeek-Coder-V2 and Qwen2.5-Coder now achieve ~70% on the same subset, closing the gap.
LiveCodeBench: definition + examples
Examples
- OpenAI's GPT-4o scored 82% pass@1 on LiveCodeBench (May 2024 edition) vs 67% on HumanEval, demonstrating reduced overfitting.
- Anthropic's Claude 3.5 Sonnet achieved 79% pass@1 on LiveCodeBench (June 2024), with particularly strong performance on dynamic programming problems.
- Google DeepMind's Gemini 1.5 Pro scored 76% on LiveCodeBench (July 2024), showing improvement over Gemini 1.0 Pro (68%).
- Meta's Code Llama 70B scored 58% on LiveCodeBench (August 2024), while the fine-tuned version Code Llama 70B-Instruct reached 64%.
- The open-source model DeepSeek-Coder-V2-Instruct achieved 71% on LiveCodeBench (September 2024), outperforming many proprietary models from 2023.
Related terms
Latest news mentioning LiveCodeBench
- InCoder-32B-Thinking Hits 81.3% on LiveCodeBench, Trained on Chip & Kernel Traces
InCoder-32B-Thinking, a 32B parameter model trained on execution traces from chip design, GPU kernels, and embedded systems, scores 81.3% on LiveCodeBench V5 and an 84% compile pass rate on CAD-Coder.
Apr 11, 2026 - DeepSeek V4 Begins Limited Rollout with Fast, Expert, Vision Modes
DeepSeek V4 is reportedly in limited gray-scale testing with a new interface offering Fast, Expert, and Vision modes. This mirrors competitor Kimi's tiered system and suggests a move towards performan
Apr 7, 2026 - Meta-Harness from Stanford/MIT Shows System Code Creates 6x AI Performance Gap
Stanford and MIT researchers show AI performance depends as much on the surrounding system code (the 'harness') as the model itself. Their Meta-Harness framework automatically improves this code, yiel
Apr 6, 2026 - Step-3.5-Flash: 196B Open-Source MoE Model Activates Only 11B Parameters, Outperforms Kimi K2.5 and Claude Opus 4.5 on Key Benchmarks
Shanghai-based StepFun's Step-3.5-Flash, a 196B parameter sparse mixture-of-experts model that activates only 11B parameters per token, achieves top scores on AIME 2025 (97.3) and LiveCodeBench-V6 (86
Mar 24, 2026
FAQ
What is LiveCodeBench?
LiveCodeBench is a dynamic benchmark for evaluating code generation models on fresh, unpublished programming problems, replacing static datasets like HumanEval to prevent data contamination.
How does LiveCodeBench work?
LiveCodeBench is a benchmark designed to measure the code generation capabilities of large language models (LLMs) on a continuous stream of new, unpublished programming problems. It was introduced in 2024 to address a critical flaw in static benchmarks like HumanEval and MBPP: data contamination. As models are trained on ever-larger corpora that often include these static benchmarks, their scores become…
Where is LiveCodeBench used in 2026?
OpenAI's GPT-4o scored 82% pass@1 on LiveCodeBench (May 2024 edition) vs 67% on HumanEval, demonstrating reduced overfitting. Anthropic's Claude 3.5 Sonnet achieved 79% pass@1 on LiveCodeBench (June 2024), with particularly strong performance on dynamic programming problems. Google DeepMind's Gemini 1.5 Pro scored 76% on LiveCodeBench (July 2024), showing improvement over Gemini 1.0 Pro (68%).