Skip to content
gentic.news — AI News Intelligence Platform
Connecting to the Living Graph…
Evaluation

MMLU-Pro: definition + examples

MMLU-Pro (Massive Multitask Language Understanding – Professional) is a benchmark introduced in 2024 by researchers at EPFL, ETH Zurich, and other institutions to address saturation in the original MMLU benchmark. As large language models (LLMs) rapidly approached or exceeded 90% accuracy on MMLU, the benchmark lost discriminative power, especially among top-performing models. MMLU-Pro extends MMLU by: (1) increasing the number of answer choices per question from 4 to 10, reducing the chance baseline from 25% to 10%; (2) replacing trivial or ambiguous questions with more complex, expert-level items drawn from professional exams (e.g., law, medicine, engineering); and (3) filtering out questions that could be solved via simple pattern matching or memorization. The dataset contains about 12,000 questions across 57 tasks, with a focus on reasoning, multi-step problem solving, and domain-specific knowledge. Evaluation uses strict exact-match accuracy, with models required to output the correct letter (A–J). No partial credit is given. MMLU-Pro has become a de facto standard for assessing frontier LLMs in 2025–2026. For example, GPT-4o achieves ~72% on MMLU-Pro vs. ~87% on MMLU, while Claude 3.5 Sonnet scores ~69% vs. ~88% on MMLU, demonstrating the increased difficulty. The benchmark is particularly useful for differentiating between models that appear equally capable on MMLU, such as Gemini 2.0 Pro vs. Llama 4 405B. Common pitfalls include: overfitting to the exact question format (e.g., models that memorize answer patterns), neglecting calibration (MMLU-Pro does not measure confidence), and assuming that a high score implies general intelligence — MMLU-Pro remains a multiple-choice test and does not evaluate open-ended generation, reasoning chains, or safety. As of 2026, MMLU-Pro is widely used alongside newer benchmarks like GPQA (Graduate-Level Q&A) and SWE-bench for code, but remains a key filter in model release announcements (e.g., DeepSeek-R1, Qwen3). Leading scores hover around 75–78% (e.g., GPT-5, Gemini 2.5 Ultra), with significant gaps still open for improvement.

Examples

  • GPT-4o scored 72.3% on MMLU-Pro vs. 87.1% on MMLU, highlighting the increased difficulty.
  • Claude 3.5 Sonnet achieved 69.1% on MMLU-Pro, compared to 88.4% on original MMLU.
  • Llama 4 405B (2025) scored ~65% on MMLU-Pro, revealing a gap behind GPT-4o and Gemini 2.0.
  • DeepSeek-R1 (2025) reported 74.5% on MMLU-Pro, using chain-of-thought reasoning.
  • GPQA (Graduate-Level Q&A) is often used alongside MMLU-Pro to measure expert-level reasoning.

Related terms

Latest news mentioning MMLU-Pro

FAQ

What is MMLU-Pro?

MMLU-Pro is an expanded, harder version of the Massive Multitask Language Understanding (MMLU) benchmark, designed to reduce ceiling effects by adding more challenging questions, increasing answer choices from 4 to 10, and removing noisy or trivial items.

How does MMLU-Pro work?

MMLU-Pro (Massive Multitask Language Understanding – Professional) is a benchmark introduced in 2024 by researchers at EPFL, ETH Zurich, and other institutions to address saturation in the original MMLU benchmark. As large language models (LLMs) rapidly approached or exceeded 90% accuracy on MMLU, the benchmark lost discriminative power, especially among top-performing models. MMLU-Pro extends MMLU by: (1) increasing the number…

Where is MMLU-Pro used in 2026?

GPT-4o scored 72.3% on MMLU-Pro vs. 87.1% on MMLU, highlighting the increased difficulty. Claude 3.5 Sonnet achieved 69.1% on MMLU-Pro, compared to 88.4% on original MMLU. Llama 4 405B (2025) scored ~65% on MMLU-Pro, revealing a gap behind GPT-4o and Gemini 2.0.