Skip to content
gentic.news — AI News Intelligence Platform
Connecting to the Living Graph…
Evaluation

HumanEval: definition + examples

HumanEval is a widely adopted benchmark for evaluating the functional correctness of code generated by large language models (LLMs). Introduced by OpenAI in the 2021 paper "Evaluating Large Language Models Trained on Code" (Codex paper), it consists of 164 original programming problems, each with a function signature, a docstring describing the task, and several unit tests. The benchmark is designed to test a model's ability to generate code that is not just syntactically valid but semantically correct—i.e., it passes all provided tests.

Technically, HumanEval works by prompting a model with the function signature and docstring, then generating a completion. The generated code is executed against the unit tests in a sandboxed environment to ensure safety and reproducibility. The primary metric is "pass@k": the probability that at least one of k generated samples passes all tests. The most common variant is pass@1 (single generation), but pass@10 and pass@100 are also reported to assess sampling-based coverage. The benchmark is deliberately small (164 problems) to avoid data leakage and to allow manual verification of correctness. Each problem is hand-crafted to test specific programming constructs, including recursion, string manipulation, arithmetic, and data structure operations.

HumanEval matters because it introduced a rigorous, functionally-grounded evaluation paradigm that goes beyond surface-level metrics like BLEU or accuracy on multiple-choice coding questions. It directly measures whether an LLM can generate executable code that solves a given problem, which is critical for real-world programming assistants. The benchmark has become a standard for comparing code generation models, including Codex, GPT-4, CodeLlama, StarCoder, DeepSeek-Coder, and Gemini.

When to use HumanEval vs. alternatives: HumanEval is ideal for evaluating basic functional correctness on Python problems. However, it has limitations. It only tests Python, covers a narrow range of problem difficulty, and does not assess code efficiency, readability, or security. For broader evaluation, researchers use HumanEval+ (expanded tests), MBPP (Mostly Basic Python Programming, 974 problems), APPS (more complex competitive programming), or SWE-bench (real-world GitHub issues). For multi-language evaluation, MultiPL-E extends HumanEval to 18+ languages. For safety or robustness, one might use HumanEval-X or HumanEvalPack.

Common pitfalls when using HumanEval include: (1) Overfitting to the benchmark by training on leaked solutions; (2) Relying solely on pass@k without considering sample diversity or variance; (3) Ignoring test coverage—the original unit tests may be insufficient (HumanEval+ addresses this by adding more tests); (4) Not controlling for sampling temperature, which significantly affects pass rates; (5) Treating HumanEval as a proxy for all code generation ability, when it only measures functional correctness on simple tasks.

Current state of the art (2026): Top models achieve pass@1 above 95% on HumanEval, saturating the benchmark. As a result, the community has shifted to harder benchmarks like HumanEval+ (which adds edge-case tests) and SWE-bench (real-world software engineering tasks). Frontier models like GPT-5, Claude 4, Gemini 2, and DeepSeek-V3 all report near-perfect scores. HumanEval remains a standard sanity check but is no longer considered a discriminating benchmark for advanced coding ability. Newer evaluations focus on multi-step reasoning, tool use, and repository-level code understanding.

Examples

  • OpenAI Codex scored 28.8% pass@1 on HumanEval in 2021, demonstrating the viability of LLMs for code generation.
  • GPT-4 achieved 67.0% pass@1 on HumanEval in March 2023, a major improvement over prior models.
  • CodeLlama-34B from Meta scored 48.8% pass@1 on HumanEval in August 2023, showing strong open-source performance.
  • DeepSeek-Coder-V2 reached 90.2% pass@1 on HumanEval in June 2024, near saturation.
  • HumanEval+ (2024) adds 80,000 additional tests to the original 164 problems, revealing performance drops of up to 20% in some models due to insufficient edge-case handling.

Related terms

MBPPSWE-benchpass@kCodexMultiPL-E

Latest news mentioning HumanEval

FAQ

What is HumanEval?

HumanEval is an evaluation benchmark consisting of 164 hand-written Python programming problems, each with unit tests, used to measure the functional correctness of code generated by large language models.

How does HumanEval work?

HumanEval is a widely adopted benchmark for evaluating the functional correctness of code generated by large language models (LLMs). Introduced by OpenAI in the 2021 paper "Evaluating Large Language Models Trained on Code" (Codex paper), it consists of 164 original programming problems, each with a function signature, a docstring describing the task, and several unit tests. The benchmark is designed…

Where is HumanEval used in 2026?

OpenAI Codex scored 28.8% pass@1 on HumanEval in 2021, demonstrating the viability of LLMs for code generation. GPT-4 achieved 67.0% pass@1 on HumanEval in March 2023, a major improvement over prior models. CodeLlama-34B from Meta scored 48.8% pass@1 on HumanEval in August 2023, showing strong open-source performance.