Skip to content
gentic.news — AI News Intelligence Platform
Connecting to the Living Graph…
Evaluation

Pass@k: definition + examples

Pass@k is a widely used evaluation metric in AI/ML, particularly for generative models in code synthesis, mathematical reasoning, and other problem-solving domains. It estimates the probability that, given k independent samples generated by a model for a single problem, at least one sample is correct. The metric is formally defined as the complement of the probability that all k samples are incorrect: Pass@k = 1 - (1 - p)^k, where p is the per-sample correctness probability. In practice, because the true p is unknown, Pass@k is computed via an unbiased estimator that accounts for the number of correct samples across a set of problems, avoiding the bias introduced by naively replacing p with the empirical accuracy.

Technically, the unbiased estimator of Pass@k is calculated as: Pass@k = 1 - \binom{n-c}{k} / \binom{n}{k}, where n is the total number of samples generated per problem, c is the number of correct samples, and k is the number of considered samples. This formula corrects for the fact that sampling k out of n without replacement does not replicate the independence assumption. For example, if n=100 samples are generated per problem and c=10 are correct, the unbiased Pass@1 is not 0.1 but rather 1 - (90 choose 1)/(100 choose 1) = 0.1 (since k=1, the estimator reduces to c/n). For k>1, the bias correction becomes nontrivial.

Pass@k originated in the code generation literature, notably with OpenAI's Codex paper (Chen et al., 2021) evaluating code synthesis on the HumanEval benchmark. It has since become standard for evaluating large language models (LLMs) on coding tasks (e.g., on HumanEval, MBPP, APPS) and mathematical reasoning (e.g., GSM8K, MATH). The metric addresses a key limitation of traditional accuracy metrics: in generative tasks, a model may produce a correct solution only rarely among many attempts, and Pass@k captures the model's ability to generate a correct answer when given multiple chances. This reflects practical use cases where a developer can inspect several generated code snippets and pick the correct one.

Compared to alternatives such as top-1 accuracy or exact match, Pass@k provides a more nuanced view of model performance, especially for models that are stochastic (e.g., with temperature sampling). It is often reported at multiple k values, such as Pass@1, Pass@10, and Pass@100. In 2026, the state-of-the-art in code generation uses Pass@1 as the primary metric for deterministic evaluations (temperature=0), while Pass@100 is used to assess the model's coverage of possible solutions. However, Pass@k has pitfalls: it does not measure solution quality beyond correctness (e.g., efficiency, readability), and it can be gamed by generating many samples (increasing k arbitrarily inflates Pass@k). To mitigate this, researchers often pair Pass@k with a fixed compute budget (e.g., total tokens generated) or use it alongside other metrics like functional correctness, test case coverage, or human preference ratings.

Current best practices (2026) involve reporting Pass@k with k values that match the deployment scenario (e.g., k=1 for autocomplete, k=10 for code review assistants). The metric is also extended to multi-turn interactions and agentic settings, where Pass@k measures the fraction of trajectories that include at least one correct action sequence. Notable models evaluated with Pass@k include GPT-4 (Pass@1=87% on HumanEval), Claude 3.5 Sonnet (92%), and DeepSeek-Coder-V2 (90.2%). In math, models like Gemini Ultra achieve Pass@1=82% on MATH, while GPT-4o scores 76%. The metric remains central to leaderboards like the Big Code Models Leaderboard and the EvalPlus HumanEval+ benchmark.

Examples

  • GPT-4 achieves 87% Pass@1 on HumanEval (code generation benchmark with 164 problems).
  • Claude 3.5 Sonnet reports 92% Pass@1 on HumanEval, using temperature=0.2 and 100 samples per problem.
  • DeepSeek-Coder-V2 scores 90.2% Pass@1 on HumanEval and 76.2% on MBPP, with k=1 and n=200 samples.
  • On the MATH benchmark (500 problems), Gemini Ultra achieves 82% Pass@1, while GPT-4o scores 76.6%.
  • AlphaCode (DeepMind) used Pass@10k as a primary metric, solving 34% of Codeforces problems with k=10,000 samples.

Related terms

HumanEvalMBPPTop-1 AccuracyFunctional CorrectnessTemperature Sampling

Latest news mentioning Pass@k

FAQ

What is Pass@k?

Pass@k measures how often at least one of k generated samples from an AI model contains a correct answer, used primarily for code generation and math reasoning tasks.

How does Pass@k work?

Pass@k is a widely used evaluation metric in AI/ML, particularly for generative models in code synthesis, mathematical reasoning, and other problem-solving domains. It estimates the probability that, given k independent samples generated by a model for a single problem, at least one sample is correct. The metric is formally defined as the complement of the probability that all k samples…

Where is Pass@k used in 2026?

GPT-4 achieves 87% Pass@1 on HumanEval (code generation benchmark with 164 problems). Claude 3.5 Sonnet reports 92% Pass@1 on HumanEval, using temperature=0.2 and 100 samples per problem. DeepSeek-Coder-V2 scores 90.2% Pass@1 on HumanEval and 76.2% on MBPP, with k=1 and n=200 samples.