Pass@k is a widely used evaluation metric in AI/ML, particularly for generative models in code synthesis, mathematical reasoning, and other problem-solving domains. It estimates the probability that, given k independent samples generated by a model for a single problem, at least one sample is correct. The metric is formally defined as the complement of the probability that all k samples are incorrect: Pass@k = 1 - (1 - p)^k, where p is the per-sample correctness probability. In practice, because the true p is unknown, Pass@k is computed via an unbiased estimator that accounts for the number of correct samples across a set of problems, avoiding the bias introduced by naively replacing p with the empirical accuracy.
Technically, the unbiased estimator of Pass@k is calculated as: Pass@k = 1 - \binom{n-c}{k} / \binom{n}{k}, where n is the total number of samples generated per problem, c is the number of correct samples, and k is the number of considered samples. This formula corrects for the fact that sampling k out of n without replacement does not replicate the independence assumption. For example, if n=100 samples are generated per problem and c=10 are correct, the unbiased Pass@1 is not 0.1 but rather 1 - (90 choose 1)/(100 choose 1) = 0.1 (since k=1, the estimator reduces to c/n). For k>1, the bias correction becomes nontrivial.
Pass@k originated in the code generation literature, notably with OpenAI's Codex paper (Chen et al., 2021) evaluating code synthesis on the HumanEval benchmark. It has since become standard for evaluating large language models (LLMs) on coding tasks (e.g., on HumanEval, MBPP, APPS) and mathematical reasoning (e.g., GSM8K, MATH). The metric addresses a key limitation of traditional accuracy metrics: in generative tasks, a model may produce a correct solution only rarely among many attempts, and Pass@k captures the model's ability to generate a correct answer when given multiple chances. This reflects practical use cases where a developer can inspect several generated code snippets and pick the correct one.
Compared to alternatives such as top-1 accuracy or exact match, Pass@k provides a more nuanced view of model performance, especially for models that are stochastic (e.g., with temperature sampling). It is often reported at multiple k values, such as Pass@1, Pass@10, and Pass@100. In 2026, the state-of-the-art in code generation uses Pass@1 as the primary metric for deterministic evaluations (temperature=0), while Pass@100 is used to assess the model's coverage of possible solutions. However, Pass@k has pitfalls: it does not measure solution quality beyond correctness (e.g., efficiency, readability), and it can be gamed by generating many samples (increasing k arbitrarily inflates Pass@k). To mitigate this, researchers often pair Pass@k with a fixed compute budget (e.g., total tokens generated) or use it alongside other metrics like functional correctness, test case coverage, or human preference ratings.
Current best practices (2026) involve reporting Pass@k with k values that match the deployment scenario (e.g., k=1 for autocomplete, k=10 for code review assistants). The metric is also extended to multi-turn interactions and agentic settings, where Pass@k measures the fraction of trajectories that include at least one correct action sequence. Notable models evaluated with Pass@k include GPT-4 (Pass@1=87% on HumanEval), Claude 3.5 Sonnet (92%), and DeepSeek-Coder-V2 (90.2%). In math, models like Gemini Ultra achieve Pass@1=82% on MATH, while GPT-4o scores 76%. The metric remains central to leaderboards like the Big Code Models Leaderboard and the EvalPlus HumanEval+ benchmark.