What is MA-ProofBench?

MA-ProofBench is a formal theorem-proving benchmark for mathematical analysis, containing 200 theorems at undergraduate and PhD levels.

How did GPT-5.5 perform on MA-ProofBench?

GPT-5.5 scored 16% Pass@8 on Level I and 5% on Level II, the best among models tested.

What are the main failure modes identified?

Mathlib hallucinations (incorrect Lean code) and incomplete proofs are the two dominant failure modes.

Listen to today's AI briefing

Daily podcast — 5 min, AI-narrated summary of top stories

Listen

Bar chart comparing AI model scores on MA-ProofBench, with GPT-5.5 reaching 16% on undergraduate and 5% on PhD…

AI ResearchScore: 82

MA-ProofBench: GPT-5.5 Hits 16% on Math Analysis, Most Models Near 0%

MA-ProofBench, a new theorem-proving benchmark for mathematical analysis, shows GPT-5.5 achieving 16% on undergraduate problems and 5% on PhD-level, with most models near 0% on the harder set.

AAAla SMITH & AI Research Desk·Jun 15, 2026·3 min read··192 views·AI-Generated·Report error

Source: arxiv.orgvia arxiv_aiMulti-Source

How well do LLMs perform on the new MA-ProofBench theorem-proving benchmark for mathematical analysis?

On MA-ProofBench, a new theorem-proving benchmark for mathematical analysis, GPT-5.5 scored 16% Pass@8 on undergraduate-level problems and 5% on PhD-level, with most models near 0% on the harder set.

TL;DR

New benchmark tests theorem proving in math analysis. · GPT-5.5 achieves only 16% Pass@8 on Level I. · Most models score near 0% on PhD-level problems.

gpt-5-5" class="entity-chip">GPT-5.5 scored 16% Pass@8 on MA-ProofBench's undergraduate-level theorem-proving problems, and 5% on PhD-level. Most models tested barely registered above 0% on the harder set, per the June 2026 arXiv preprint.

Key facts

GPT-5.5 achieved 16% Pass@8 on Level I, 5% on Level II.
Most models scored near 0% on Level II PhD problems.
Benchmark has 200 theorems across 6 core topics.
Two dominant failure modes: Mathlib hallucinations and incomplete proofs.
Natural-language version shows a clear informal-formal reasoning gap.

Researchers released MA-ProofBench according to the arXiv preprint, the first formal theorem-proving benchmark dedicated to mathematical analysis. The benchmark comprises 200 theorems across 6 core topics and 27 subcategories, including measure theory, complex analysis, and functional analysis. Problems are split into two difficulty tiers: Level I (undergraduate, 100 problems) and Level II (PhD qualifying, 100 problems).

Results: Near-zero on advanced reasoning

On Level I, GPT-5.5 achieved 16% Pass@8, while most other general-purpose reasoning models and formal theorem provers scored below 10%. On Level II, GPT-5.5 dropped to 5%, and the majority of models stayed close to 0%. The authors note that existing formal benchmarks concentrate on easier-to-formalize areas like algebra and elementary number theory, leaving a gap in advanced domains requiring deeper reasoning.

Failure modes: Hallucination and incompleteness

The paper identifies two dominant failure modes: Mathlib hallucinations (models generating plausible-looking but incorrect Lean code referencing non-existent library entities) and incomplete proofs (models starting correctly but failing to finish). An evaluation on natural-language versions of the same problems revealed a clear gap between informal and formal reasoning — models performed significantly better when not constrained by formal syntax.

Implications for AI reasoning

MA-ProofBench exposes a stark ceiling on current LLMs' ability to perform rigorous formal reasoning in advanced mathematics. The near-zero performance on Level II suggests that today's models, including frontier systems like GPT-5.5, lack the depth to handle PhD-level formal proofs. The benchmark is intended as a reference for tracking progress, but the current results indicate that formal theorem proving in analysis remains largely unsolved.

What to watch

Watch for future model releases on MA-ProofBench, especially from OpenAI and Anthropic. The benchmark's public leaderboard will reveal whether next-generation reasoning models can crack the 20% barrier on Level II, or if architectural changes are needed to handle formal analysis.

Figure 3: Overview of the curation workflow of MA-ProofBench, comprising Problem Collection, Formalization, Independent

Source: arxiv.org

Source: gentic.news · Jun 15, 2026 · author=Ala SMITH · citation.json

AI-assisted reporting. Generated by gentic.news from multiple verified sources, fact-checked against the Living Graph of 4,300+ entities. Edited by Ala SMITH.

Following this story?

Get a weekly digest with AI predictions, trends, and analysis — free.

AI Analysis

MA-ProofBench fills a critical gap in formal reasoning evaluation by targeting mathematical analysis, a domain notably absent from prior benchmarks like MiniF2F or ProofNet. The near-zero performance on Level II underscores a fundamental limitation: current LLMs, including GPT-5.5, cannot reliably synthesize formal proofs requiring multi-step reasoning over abstract concepts like measure theory or functional analysis. The identification of Mathlib hallucinations as a failure mode is particularly telling — it suggests models are memorizing patterns from training data without a grounded understanding of Lean's library structure. This contrasts with gains in informal reasoning benchmarks, where models like GPT-5.5 have shown steady improvement. The gap between natural-language and formal performance indicates that the formal syntax layer, not the underlying math, remains the primary bottleneck. The benchmark's design, with human-led formalization and expert review, sets a high bar for reliability, making it a credible stress test for future models.

#mathematics #theorem proving #benchmarks #ai research

Mentioned in this article

MA-ProofBench GPT-5.5

Enjoyed this article?

Get the weekly AI intelligence briefing

✨AI Toolslive

Five one-click lenses on this article. Cached for 24h.

Pick a tool above to generate an instant lens on this article.

AI Research

Epoch AI: Google's Colossus 1 Training Compute Hits 1e26 FLOP

From the lab

The framework underneath this story

Every article on this site sits on top of one engine and one framework — both built by the lab.

Original research · EUMAS 2026

MNEMA — A Witness Lattice for Multi-Agent AI Memory

Cryptographic memory units · 1−α detection floor · 15 pp PDF

Field framework · v1.0

Epistemic Infrastructure

12 pillars · 11-stage knowledge metabolism · pathology catalog

MA-ProofBench: GPT-5.5 Hits 16% on Math Analysis, Most Models Near 0%

Results: Near-zero on advanced reasoning

Failure modes: Hallucination and incompleteness

Implications for AI reasoning

What to watch

AI Analysis

✨AI Toolslive

Related Articles

OpenAI hits 38.3% on ARC-AGI-3 with custom API, bypassing official harness

AgiBot WITA-Omni Scores 85.21 on DailyOmni, Beats Gemini

BYD HyWorldVLA Hits 90.59 PDMS on NAVSIM v1

Claude Mythos Finds HAWK Attack in 60 Hours for $100K

Opus 5 Hits 0% Prompt Injection Rate in Browser Agents