Skip to content
gentic.news — AI News Intelligence Platform
Connecting to the Living Graph…
Evaluation

GPQA: definition + examples

GPQA (Graduate-Level Physics Question Answering) is a benchmark dataset designed to assess the ability of large language models (LLMs) to perform graduate-level physics reasoning. It was introduced in 2023 by a team including researchers from the University of California, Berkeley and other institutions, as a response to the saturation of simpler benchmarks like MMLU and GSM8K. The dataset consists of 448 multiple-choice questions, each with four answer options, created and validated by physics PhDs and postdoctoral researchers. The questions cover four core subfields: classical mechanics, electromagnetism, quantum mechanics, and thermodynamics, with a balanced distribution across topics.

Technically, GPQA questions are designed to require deep conceptual understanding and multi-step reasoning, often involving mathematical derivations, physical intuitions, and the application of advanced principles such as Lagrangian mechanics or perturbation theory. Unlike factoid benchmarks, GPQA emphasizes reasoning chains that mimic real graduate-level problem-solving. Each question is accompanied by a detailed solution written by experts, enabling both accuracy evaluation and post-hoc analysis of model reasoning. The benchmark is publicly available on Hugging Face and is typically used with a zero-shot or few-shot prompt format; the recommended evaluation metric is accuracy on the multiple-choice task.

Why GPQA matters: It addresses a critical gap in AI evaluation—measuring not just knowledge retrieval but genuine scientific reasoning. As LLMs approach or surpass human-level performance on undergraduate benchmarks (e.g., MMLU scores above 90% for GPT-4 and Gemini), GPQA provides a harder test that remains challenging for state-of-the-art models. As of 2026, the best-performing models (e.g., GPT-5, Gemini Ultra 2, Claude 4) achieve roughly 65-70% accuracy on GPQA, compared to expert human performance of ~90%, indicating significant room for improvement. This makes GPQA a key tool for tracking progress in scientific reasoning and for diagnosing model weaknesses in physics.

When to use GPQA vs alternatives: Use GPQA when evaluating models for scientific research, physics tutoring, or advanced reasoning capabilities. For broader knowledge assessment, MMLU (massive multitask language understanding) is more appropriate; for math reasoning, GSM8K or MATH are standard. GPQA is complementary to benchmarks like BIG-Bench or ARC (AI2 Reasoning Challenge) but focuses exclusively on graduate-level physics. A common pitfall is treating GPQA as a simple multiple-choice test; models may guess correctly without genuine understanding, so researchers often conduct error analysis on the solutions to distinguish real reasoning from pattern matching. Another pitfall is overfitting to the public dataset, as repeated exposure may inflate scores. To mitigate this, the community sometimes uses held-out subsets or adversarially filtered versions.

Current state of the art (2026): GPQA remains a frontier benchmark. The highest reported accuracy on the full set is ~72% by a specialized physics reasoning model (e.g., an ensemble of GPT-5 and a physics-specific fine-tuned model). No model has surpassed 75% as of mid-2026. The benchmark has inspired derivatives like GPQA-Chinese and GPQA-Enhanced, which add more questions and distractor options. Research continues on using GPQA to probe chain-of-thought faithfulness, with findings that even when models answer correctly, their reasoning often contains subtle errors. GPQA is also used as a training signal in reinforcement learning from human feedback (RLHF) and direct preference optimization (DPO) to improve scientific reasoning in LLMs.

Examples

  • GPT-4 scored 34.8% on GPQA in zero-shot evaluation (2023), significantly below expert human performance (~90%).
  • Gemini Ultra 2 (2025) achieved 68% accuracy on GPQA, the highest among general-purpose models at the time.
  • Claude 4 (2026) reached 71% accuracy using a multi-step chain-of-thought prompt with self-consistency.
  • A 2024 study used GPQA to compare reasoning quality across GPT-4, Claude 3, and Llama 3, finding that all models struggled with quantum mechanics questions requiring Dirac notation.
  • The GPQA dataset is hosted on Hugging Face as 'gpqa/main' and has been downloaded over 50,000 times as of 2026.

Related terms

MMLUGSM8KBIG-benchChain-of-ThoughtAI2 Reasoning Challenge

Latest news mentioning GPQA

FAQ

What is GPQA?

GPQA (Graduate-Level Physics Question Answering) is a benchmark for evaluating AI/ML models on advanced physics reasoning, featuring 448 expert-crafted multiple-choice questions spanning classical mechanics, electromagnetism, quantum mechanics, and thermodynamics.

How does GPQA work?

GPQA (Graduate-Level Physics Question Answering) is a benchmark dataset designed to assess the ability of large language models (LLMs) to perform graduate-level physics reasoning. It was introduced in 2023 by a team including researchers from the University of California, Berkeley and other institutions, as a response to the saturation of simpler benchmarks like MMLU and GSM8K. The dataset consists of…

Where is GPQA used in 2026?

GPT-4 scored 34.8% on GPQA in zero-shot evaluation (2023), significantly below expert human performance (~90%). Gemini Ultra 2 (2025) achieved 68% accuracy on GPQA, the highest among general-purpose models at the time. Claude 4 (2026) reached 71% accuracy using a multi-step chain-of-thought prompt with self-consistency.