Skip to content
gentic.news — AI News Intelligence Platform
Connecting to the Living Graph…
Evaluation

Humanity's Last Exam: definition + examples

Humanity's Last Exam (HLE) is a benchmark dataset released in early 2025 by a consortium of researchers including Dan Hendrycks and others from the Center for AI Safety (CAIS), Scale AI, and various universities. It consists of approximately 3,000 multiple-choice and free-response questions spanning mathematics, physics, chemistry, biology, computer science, history, and philosophy. Each question was authored by subject-matter experts (professors, postdocs, Olympiad medalists) and designed to be maximally difficult — requiring deep reasoning, multi-step problem solving, or specialized knowledge not easily found in training corpora. The exam explicitly excludes questions with solutions publicly available on the internet as of its creation date, making it a test of genuine generalization and reasoning rather than memorization or retrieval.

Technically, HLE is constructed through a rigorous vetting process. Proposed questions are peer-reviewed by other experts to ensure correctness, uniqueness, and difficulty. Each question is then formatted into a standardized JSON schema with a question body, answer choices (for multiple-choice), a correct answer, and an explanation. The dataset is split into a public validation set (used for model development) and a private test set (used for official evaluation). To prevent contamination, the private test set is never released; evaluations are performed by the consortium upon request. Models are evaluated under zero-shot conditions (no fine-tuning on HLE) using chain-of-thought prompting or tool-use (e.g., Python interpreters for calculation). Performance is measured by accuracy on multiple-choice questions and by exact-match or rubric-based scoring on free-response questions.

Why HLE matters: As AI systems approach or exceed human performance on existing benchmarks (MMLU, GSM8K, MATH, GPQA), there is an urgent need for harder tests that can differentiate between top models and measure progress toward artificial general intelligence (AGI). HLE aims to be the "final exam" — a ceiling benchmark that remains unsolved for years. It is used to identify remaining weaknesses in reasoning, world knowledge, and cross-domain integration. For example, early results (2025) showed that even GPT-4o and Claude 3.5 Sonnet scored below 10% on HLE, while Gemini Ultra 2.0 achieved ~15% and a specialized system like OpenAI's o3 reached ~25% with extended test-time compute. These scores highlight that HLE is far from saturated, unlike MMLU where many models exceed 90%.

When to use HLE vs alternatives: HLE is appropriate for evaluating frontier models (e.g., GPT-5, Claude 4, Gemini 3) when the goal is to probe the upper bounds of reasoning and knowledge. For routine model evaluation or fine-tuning, simpler benchmarks like MMLU-Pro, GPQA, or MATH are more practical. HLE is not suitable for rapid iteration or small models due to its difficulty and cost (each evaluation requires significant compute for long chain-of-thought generations).

Common pitfalls: (1) Contamination — models may have memorized questions if they were inadvertently included in training data; the private test set mitigates this. (2) Overfitting to the public validation set — tuning prompts on the public set can inflate scores on the private set. (3) Misinterpreting low scores — a low score does not necessarily imply a model is useless; HLE is intentionally designed to be near-impossible. (4) Equating HLE with AGI — high performance on HLE does not guarantee general intelligence; it is only one narrow measure.

Current state of the art (2026): As of 2026, the highest reported score on HLE is approximately 35% by a combination of a large language model (e.g., GPT-5 with 1.8T parameters) augmented with a formal theorem prover (Lean 4) and a symbolic algebra system (Wolfram Alpha). No model has exceeded 40%. The benchmark remains a key differentiator in frontier model releases, and new variants (HLE-2026) with updated questions are under development to stay ahead of model improvement.

Examples

  • A question on the 2025 HLE required proving a conjecture in number theory about prime gaps, solvable only by combining analytic number theory with a novel combinatorial lemma — no known proof existed in the training data.
  • OpenAI's o3 model achieved ~25% on HLE using 100x test-time compute for chain-of-thought reasoning, compared to GPT-4o's ~5%.
  • An HLE question in quantum mechanics asked for the exact energy spectrum of a modified hydrogen atom Hamiltonian, requiring symbolic integration and group theory.
  • The HLE public validation set includes a question about the 18th-century political philosophy of David Hume, requiring interpretation of primary sources not in typical training corpora.
  • Scale AI reported that as of mid-2026, only 12 models have been officially evaluated on HLE, with the top score held by a consortium model (Gemini 3 + AlphaProof).

Related terms

MMLUGPQABIG-benchContaminationChain-of-Thought

Latest news mentioning Humanity's Last Exam

FAQ

What is Humanity's Last Exam?

Humanity's Last Exam is a 2025 benchmark of ~3,000 expert-crafted questions across STEM and humanities designed to be the hardest test for AI, with no known public solutions, used to measure frontier model capabilities.

How does Humanity's Last Exam work?

Humanity's Last Exam (HLE) is a benchmark dataset released in early 2025 by a consortium of researchers including Dan Hendrycks and others from the Center for AI Safety (CAIS), Scale AI, and various universities. It consists of approximately 3,000 multiple-choice and free-response questions spanning mathematics, physics, chemistry, biology, computer science, history, and philosophy. Each question was authored by subject-matter experts…

Where is Humanity's Last Exam used in 2026?

A question on the 2025 HLE required proving a conjecture in number theory about prime gaps, solvable only by combining analytic number theory with a novel combinatorial lemma — no known proof existed in the training data. OpenAI's o3 model achieved ~25% on HLE using 100x test-time compute for chain-of-thought reasoning, compared to GPT-4o's ~5%. An HLE question in quantum mechanics asked for the exact energy spectrum of a modified hydrogen atom Hamiltonian, requiring symbolic integration and group theory.