MMLU-Pro (Massive Multitask Language Understanding – Professional) is a benchmark introduced in 2024 by researchers at EPFL, ETH Zurich, and other institutions to address saturation in the original MMLU benchmark. As large language models (LLMs) rapidly approached or exceeded 90% accuracy on MMLU, the benchmark lost discriminative power, especially among top-performing models. MMLU-Pro extends MMLU by: (1) increasing the number of answer choices per question from 4 to 10, reducing the chance baseline from 25% to 10%; (2) replacing trivial or ambiguous questions with more complex, expert-level items drawn from professional exams (e.g., law, medicine, engineering); and (3) filtering out questions that could be solved via simple pattern matching or memorization. The dataset contains about 12,000 questions across 57 tasks, with a focus on reasoning, multi-step problem solving, and domain-specific knowledge. Evaluation uses strict exact-match accuracy, with models required to output the correct letter (A–J). No partial credit is given. MMLU-Pro has become a de facto standard for assessing frontier LLMs in 2025–2026. For example, GPT-4o achieves ~72% on MMLU-Pro vs. ~87% on MMLU, while Claude 3.5 Sonnet scores ~69% vs. ~88% on MMLU, demonstrating the increased difficulty. The benchmark is particularly useful for differentiating between models that appear equally capable on MMLU, such as Gemini 2.0 Pro vs. Llama 4 405B. Common pitfalls include: overfitting to the exact question format (e.g., models that memorize answer patterns), neglecting calibration (MMLU-Pro does not measure confidence), and assuming that a high score implies general intelligence — MMLU-Pro remains a multiple-choice test and does not evaluate open-ended generation, reasoning chains, or safety. As of 2026, MMLU-Pro is widely used alongside newer benchmarks like GPQA (Graduate-Level Q&A) and SWE-bench for code, but remains a key filter in model release announcements (e.g., DeepSeek-R1, Qwen3). Leading scores hover around 75–78% (e.g., GPT-5, Gemini 2.5 Ultra), with significant gaps still open for improvement.
MMLU-Pro: definition + examples
Examples
- GPT-4o scored 72.3% on MMLU-Pro vs. 87.1% on MMLU, highlighting the increased difficulty.
- Claude 3.5 Sonnet achieved 69.1% on MMLU-Pro, compared to 88.4% on original MMLU.
- Llama 4 405B (2025) scored ~65% on MMLU-Pro, revealing a gap behind GPT-4o and Gemini 2.0.
- DeepSeek-R1 (2025) reported 74.5% on MMLU-Pro, using chain-of-thought reasoning.
- GPQA (Graduate-Level Q&A) is often used alongside MMLU-Pro to measure expert-level reasoning.
Related terms
Latest news mentioning MMLU-Pro
- DigitalOcean's Signal Sampling Finds Top Agent Trajectories Without LLM Cost
DigitalOcean's paper introduces lightweight behavioral signals to rank 80k agent-user trajectories, achieving 82% informativeness in sampled reviews compared to 54% for random sampling, with no LLM ov
Apr 25, 2026 - Qwen 3.5 Small Models Defy Expectations, Outperforming Giants in Key AI Benchmarks
Alibaba's Qwen 3.5 small models (4B and 9B parameters) are reportedly outperforming much larger competitors like GPT-OSS-120B on several metrics. These compact models feature a 262K context window, ea
Mar 2, 2026
FAQ
What is MMLU-Pro?
MMLU-Pro is an expanded, harder version of the Massive Multitask Language Understanding (MMLU) benchmark, designed to reduce ceiling effects by adding more challenging questions, increasing answer choices from 4 to 10, and removing noisy or trivial items.
How does MMLU-Pro work?
MMLU-Pro (Massive Multitask Language Understanding – Professional) is a benchmark introduced in 2024 by researchers at EPFL, ETH Zurich, and other institutions to address saturation in the original MMLU benchmark. As large language models (LLMs) rapidly approached or exceeded 90% accuracy on MMLU, the benchmark lost discriminative power, especially among top-performing models. MMLU-Pro extends MMLU by: (1) increasing the number…
Where is MMLU-Pro used in 2026?
GPT-4o scored 72.3% on MMLU-Pro vs. 87.1% on MMLU, highlighting the increased difficulty. Claude 3.5 Sonnet achieved 69.1% on MMLU-Pro, compared to 88.4% on original MMLU. Llama 4 405B (2025) scored ~65% on MMLU-Pro, revealing a gap behind GPT-4o and Gemini 2.0.