Skip to content
gentic.news — AI News Intelligence Platform
Connecting to the Living Graph…

Listen to today's AI briefing

Daily podcast — 5 min, AI-narrated summary of top stories

Bar chart comparing AI model scores on MA-ProofBench, with GPT-5.5 reaching 16% on undergraduate and 5% on PhD…
AI ResearchScore: 82

MA-ProofBench: GPT-5.5 Hits 16% on Math Analysis, Most Models Near 0%

MA-ProofBench, a new theorem-proving benchmark for mathematical analysis, shows GPT-5.5 achieving 16% on undergraduate problems and 5% on PhD-level, with most models near 0% on the harder set.

·1d ago·3 min read··22 views·AI-Generated·Report error
Share:
Source: arxiv.orgvia arxiv_aiMulti-Source
How well do LLMs perform on the new MA-ProofBench theorem-proving benchmark for mathematical analysis?

On MA-ProofBench, a new theorem-proving benchmark for mathematical analysis, GPT-5.5 scored 16% Pass@8 on undergraduate-level problems and 5% on PhD-level, with most models near 0% on the harder set.

TL;DR

New benchmark tests theorem proving in math analysis. · GPT-5.5 achieves only 16% Pass@8 on Level I. · Most models score near 0% on PhD-level problems.

gpt-5-5" class="entity-chip">GPT-5.5 scored 16% Pass@8 on MA-ProofBench's undergraduate-level theorem-proving problems, and 5% on PhD-level. Most models tested barely registered above 0% on the harder set, per the June 2026 arXiv preprint.

Key facts

  • GPT-5.5 achieved 16% Pass@8 on Level I, 5% on Level II.
  • Most models scored near 0% on Level II PhD problems.
  • Benchmark has 200 theorems across 6 core topics.
  • Two dominant failure modes: Mathlib hallucinations and incomplete proofs.
  • Natural-language version shows a clear informal-formal reasoning gap.

Researchers released MA-ProofBench according to the arXiv preprint, the first formal theorem-proving benchmark dedicated to mathematical analysis. The benchmark comprises 200 theorems across 6 core topics and 27 subcategories, including measure theory, complex analysis, and functional analysis. Problems are split into two difficulty tiers: Level I (undergraduate, 100 problems) and Level II (PhD qualifying, 100 problems).

Results: Near-zero on advanced reasoning

On Level I, GPT-5.5 achieved 16% Pass@8, while most other general-purpose reasoning models and formal theorem provers scored below 10%. On Level II, GPT-5.5 dropped to 5%, and the majority of models stayed close to 0%. The authors note that existing formal benchmarks concentrate on easier-to-formalize areas like algebra and elementary number theory, leaving a gap in advanced domains requiring deeper reasoning.

Failure modes: Hallucination and incompleteness

The paper identifies two dominant failure modes: Mathlib hallucinations (models generating plausible-looking but incorrect Lean code referencing non-existent library entities) and incomplete proofs (models starting correctly but failing to finish). An evaluation on natural-language versions of the same problems revealed a clear gap between informal and formal reasoning — models performed significantly better when not constrained by formal syntax.

Implications for AI reasoning

MA-ProofBench exposes a stark ceiling on current LLMs' ability to perform rigorous formal reasoning in advanced mathematics. The near-zero performance on Level II suggests that today's models, including frontier systems like GPT-5.5, lack the depth to handle PhD-level formal proofs. The benchmark is intended as a reference for tracking progress, but the current results indicate that formal theorem proving in analysis remains largely unsolved.

What to watch

Watch for future model releases on MA-ProofBench, especially from OpenAI and Anthropic. The benchmark's public leaderboard will reveal whether next-generation reasoning models can crack the 20% barrier on Level II, or if architectural changes are needed to handle formal analysis.

Figure 3: Overview of the curation workflow of MA-ProofBench, comprising Problem Collection, Formalization, Independent


Source: arxiv.org


Source: gentic.news · · author= · citation.json

AI-assisted reporting. Generated by gentic.news from multiple verified sources, fact-checked against the Living Graph of 4,300+ entities. Edited by Ala SMITH.

Following this story?

Get a weekly digest with AI predictions, trends, and analysis — free.

AI Analysis

MA-ProofBench fills a critical gap in formal reasoning evaluation by targeting mathematical analysis, a domain notably absent from prior benchmarks like MiniF2F or ProofNet. The near-zero performance on Level II underscores a fundamental limitation: current LLMs, including GPT-5.5, cannot reliably synthesize formal proofs requiring multi-step reasoning over abstract concepts like measure theory or functional analysis. The identification of Mathlib hallucinations as a failure mode is particularly telling — it suggests models are memorizing patterns from training data without a grounded understanding of Lean's library structure. This contrasts with gains in informal reasoning benchmarks, where models like GPT-5.5 have shown steady improvement. The gap between natural-language and formal performance indicates that the formal syntax layer, not the underlying math, remains the primary bottleneck. The benchmark's design, with human-led formalization and expert review, sets a high bar for reliability, making it a credible stress test for future models.
Compare side-by-side
MA-ProofBench vs GPT-5.5

Mentioned in this article

Enjoyed this article?
Share:

AI Toolslive

Five one-click lenses on this article. Cached for 24h.

Pick a tool above to generate an instant lens on this article.

Related Articles

From the lab

The framework underneath this story

Every article on this site sits on top of one engine and one framework — both built by the lab.

More in AI Research

View all