Skip to content
gentic.news — AI News Intelligence Platform
Connecting to the Living Graph…

Listen to today's AI briefing

Daily podcast — 5 min, AI-narrated summary of top stories

A line chart showing a sharp upward slope from below 10% to 70%, with a Google logo and text 'LEAP scaffold' above…
AI ResearchScore: 85

Google LEAP Scaffold Lifts Lean-IMO-Bench One-Shot Solve Rate from <10% to 70%

Google's LEAP scaffold lifts Lean-IMO-Bench one-shot solve rate from <10% to 70%, solving all 12 Putnam 2025 problems.

·4h ago·3 min read··9 views·AI-Generated·Report error
Share:
How did Google's LEAP scaffold improve automated math reasoning performance?

Google's LEAP agent scaffold wraps a general-purpose LLM in Lean compiler feedback loops, solving all 12 Putnam 2025 problems and lifting Lean-IMO-Bench one-shot solve rate from under 10% to 70%, beating a gold-medal system scoring 48%.

TL;DR

LEAP wraps LLM in Lean compiler scaffold. · Google's system solves all 12 Putnam 2025 problems. · One-shot solve rate jumps from under 10% to 70%.

Google's LEAP agentic scaffold lifts Lean-IMO-Bench one-shot solve rate from under 10% to 70%. The same general-purpose LLM solves all 12 Putnam 2025 problems, beating a specialized gold-medal system scoring 48%.

Key facts

  • Lean-IMO-Bench one-shot solve rate: <10% to 70%.
  • All 12 Putnam 2025 problems solved by same model.
  • Beats specialized gold-medal system scoring 48%.
  • Lean compiler grounds every step with verifier feedback.
  • arXiv preprint: https://t.co/bh4Yoi19E2

Google researchers have developed LEAP, an agentic scaffold that wraps a general-purpose LLM in a Lean compiler feedback loop. According to @omarsar0 and the arXiv preprint, the system grounds every generation step in the Lean theorem prover's verifier, iterating on compiler error messages until the proof checks out.

The result is striking: the same underlying model solves all 12 problems from the 2025 William Lowell Putnam Mathematical Competition — the premier undergraduate math contest — and pushes Lean-IMO-Bench one-shot accuracy from under 10% to 70%. That beats a specialized gold-medal system that achieves only 48% on the same benchmark.

How LEAP differs from prior work

Prior agentic math systems either fine-tuned models on proof corpora or used hand-crafted search strategies. LEAP instead treats the Lean compiler as a hard verifier: each proposed proof step is compiled, the error message is fed back to the LLM, and the model revises until the step type-checks. This eliminates the need for curated training data or domain-specific fine-tuning.

The scaffold does not require a specialized math model — it works with a general-purpose LLM, suggesting the bottleneck in automated theorem proving is not model capability but the absence of structured feedback loops. The paper does not disclose which general-purpose LLM was used, nor the total compute cost of the scaffolded runs.

Implications for agentic AI

LEAP joins a growing line of work showing that agentic scaffolds can unlock latent reasoning from base models. The 70% one-shot solve rate on Lean-IMO-Bench is the highest reported for a general-purpose system, though the benchmark's problem set is small (IMO problems translated to Lean). The Putnam result is anecdotal — all 12 problems solved, but without a formal benchmark leaderboard.

Still, the gap between a general model plus scaffold (70%) and a specialized system (48%) suggests that compiler-grounded iteration may be more valuable than domain-specific training. The approach generalizes: any formal verification environment (Lean, Coq, Isabelle) could serve as the verifier for other reasoning domains.

What to watch

Watch for the arXiv paper's full ablation on which general-purpose LLM was used, and whether the scaffold generalizes to other formal environments (Coq, Isabelle) without modification. A public leaderboard for Putnam problems would validate the claim.

Source: gentic.news · · author= · citation.json

AI-assisted reporting. Generated by gentic.news from multiple verified sources, fact-checked against the Living Graph of 4,300+ entities. Edited by Ala SMITH.

Following this story?

Get a weekly digest with AI predictions, trends, and analysis — free.

AI Analysis

LEAP's core insight — that compiler-grounded iteration can unlock latent reasoning from a general-purpose LLM — challenges the prevailing orthodoxy that specialized fine-tuning is necessary for hard math domains. The 70% one-shot solve rate on Lean-IMO-Bench, versus 48% for a gold-medal system, suggests the bottleneck is not model capability but the absence of structured feedback loops. This is consistent with recent work on agentic scaffolds (e.g., Reflexion, Self-Debug) but LEAP applies the principle to a formal verification setting where the verifier is deterministic and exhaustive. The result is a clean ablation: same model, different scaffold, 7x improvement. The Putnam claim — all 12 problems solved — is striking but lacks a standard benchmark; the paper would benefit from a public leaderboard. The approach has limitations. It requires a formal specification of the problem, which is labor-intensive to produce. The paper does not disclose the total compute cost, making it hard to compare efficiency. Still, LEAP is an important proof that agentic scaffolds, not bigger models, may be the path to reliable automated reasoning.
Compare side-by-side
LEAP vs Lean-IMO-Bench

Mentioned in this article

Enjoyed this article?
Share:

AI Toolslive

Five one-click lenses on this article. Cached for 24h.

Pick a tool above to generate an instant lens on this article.

Related Articles

From the lab

The framework underneath this story

Every article on this site sits on top of one engine and one framework — both built by the lab.

More in AI Research

View all