Listen to today's AI briefing

Daily podcast — 5 min, AI-narrated summary of top stories

Google LEAP Scaffold Lifts Lean-IMO-Bench One-Shot Solve Rate from <10% to 70%

Google's LEAP scaffold lifts Lean-IMO-Bench one-shot solve rate from <10% to 70%, solving all 12 Putnam 2025 problems.

AAAla SMITH & AI Research Desk·Jun 3, 2026·3 min read··114 views·AI-Generated·Report error

Source: x.comvia @omarsar0Single Source

How did Google's LEAP scaffold improve automated math reasoning performance?

Google's LEAP agent scaffold wraps a general-purpose LLM in Lean compiler feedback loops, solving all 12 Putnam 2025 problems and lifting Lean-IMO-Bench one-shot solve rate from under 10% to 70%, beating a gold-medal system scoring 48%.

TL;DR

LEAP wraps LLM in Lean compiler scaffold. · Google's system solves all 12 Putnam 2025 problems. · One-shot solve rate jumps from under 10% to 70%.

Google's LEAP agentic scaffold lifts Lean-IMO-Bench one-shot solve rate from under 10% to 70%. The same general-purpose LLM solves all 12 Putnam 2025 problems, beating a specialized gold-medal system scoring 48%.

Key facts

Lean-IMO-Bench one-shot solve rate: <10% to 70%.
All 12 Putnam 2025 problems solved by same model.
Beats specialized gold-medal system scoring 48%.
Lean compiler grounds every step with verifier feedback.
arXiv preprint: https://t.co/bh4Yoi19E2

Google researchers have developed LEAP, an agentic scaffold that wraps a general-purpose LLM in a Lean compiler feedback loop. According to @omarsar0 and the arXiv preprint, the system grounds every generation step in the Lean theorem prover's verifier, iterating on compiler error messages until the proof checks out.

The result is striking: the same underlying model solves all 12 problems from the 2025 William Lowell Putnam Mathematical Competition — the premier undergraduate math contest — and pushes Lean-IMO-Bench one-shot accuracy from under 10% to 70%. That beats a specialized gold-medal system that achieves only 48% on the same benchmark.

How LEAP differs from prior work

Prior agentic math systems either fine-tuned models on proof corpora or used hand-crafted search strategies. LEAP instead treats the Lean compiler as a hard verifier: each proposed proof step is compiled, the error message is fed back to the LLM, and the model revises until the step type-checks. This eliminates the need for curated training data or domain-specific fine-tuning.

The scaffold does not require a specialized math model — it works with a general-purpose LLM, suggesting the bottleneck in automated theorem proving is not model capability but the absence of structured feedback loops. The paper does not disclose which general-purpose LLM was used, nor the total compute cost of the scaffolded runs.

Implications for agentic AI

LEAP joins a growing line of work showing that agentic scaffolds can unlock latent reasoning from base models. The 70% one-shot solve rate on Lean-IMO-Bench is the highest reported for a general-purpose system, though the benchmark's problem set is small (IMO problems translated to Lean). The Putnam result is anecdotal — all 12 problems solved, but without a formal benchmark leaderboard.

Still, the gap between a general model plus scaffold (70%) and a specialized system (48%) suggests that compiler-grounded iteration may be more valuable than domain-specific training. The approach generalizes: any formal verification environment (Lean, Coq, Isabelle) could serve as the verifier for other reasoning domains.

What to watch

Watch for the arXiv paper's full ablation on which general-purpose LLM was used, and whether the scaffold generalizes to other formal environments (Coq, Isabelle) without modification. A public leaderboard for Putnam problems would validate the claim.

Source: gentic.news · Jun 3, 2026 · author=Ala SMITH · citation.json

AI-assisted reporting. Generated by gentic.news from multiple verified sources, fact-checked against the Living Graph of 4,300+ entities. Edited by Ala SMITH.

Following this story?

Get a weekly digest with AI predictions, trends, and analysis — free.

AI Analysis

LEAP's core insight — that compiler-grounded iteration can unlock latent reasoning from a general-purpose LLM — challenges the prevailing orthodoxy that specialized fine-tuning is necessary for hard math domains. The 70% one-shot solve rate on Lean-IMO-Bench, versus 48% for a gold-medal system, suggests the bottleneck is not model capability but the absence of structured feedback loops. This is consistent with recent work on agentic scaffolds (e.g., Reflexion, Self-Debug) but LEAP applies the principle to a formal verification setting where the verifier is deterministic and exhaustive. The result is a clean ablation: same model, different scaffold, 7x improvement. The Putnam claim — all 12 problems solved — is striking but lacks a standard benchmark; the paper would benefit from a public leaderboard. The approach has limitations. It requires a formal specification of the problem, which is labor-intensive to produce. The paper does not disclose the total compute cost, making it hard to compare efficiency. Still, LEAP is an important proof that agentic scaffolds, not bigger models, may be the path to reliable automated reasoning.

#lean #automated reasoning #google #ai research

This story is part of

The AI Infrastructure War Shifts from Chips to Developer Tools

Nvidia's enterprise pivot and AWS's OpenAI bet collide with Cursor's quiet ascent

Compare side-by-side

LEAP vs Lean-IMO-Bench

→

Mentioned in this article

Google LEAP Lean-IMO-Bench Lean

Enjoyed this article?

Get the weekly AI intelligence briefing

✨AI Toolslive

Five one-click lenses on this article. Cached for 24h.

Pick a tool above to generate an instant lens on this article.

AI Research

MCP Confused Deputy: Protocol Design Lacks Provenance, Enables Injection

From the lab

The framework underneath this story

Every article on this site sits on top of one engine and one framework — both built by the lab.

Original research · EUMAS 2026

MNEMA — A Witness Lattice for Multi-Agent AI Memory

Cryptographic memory units · 1−α detection floor · 15 pp PDF

Field framework · v1.0

Epistemic Infrastructure

12 pillars · 11-stage knowledge metabolism · pathology catalog

Google LEAP Scaffold Lifts Lean-IMO-Bench One-Shot Solve Rate from <10% to 70%

How LEAP differs from prior work

Implications for agentic AI

What to watch

AI Analysis

✨AI Toolslive

Related Articles

Kimi K3 Tops US Models in Front-End Coding at Smaller Scale

Moonshot AI's Kimi K3: 2.8T params, 1M token window, $3/M input

Japan Builds $2B+ Rubin AI Factory for National Robotics Push

Crusoe, Lancium Build 1GW Texas AI Campus, Sidestepping Grid

Dongfang Suanxin Claims 14nm HBM-Free Chip Beats H200 Bandwidth

MCP Confused Deputy: Protocol Design Lacks Provenance, Enables Injection

The framework underneath this story

More in AI Research

LLMs Learn to Switch Reasoning Effort at Inference Time

HG-RAG Beats Flat Retrieval on Graph Queries Across 800-Node Worlds

LongStraw Reaches 2.1M Tokens on 8 H20 GPUs via Branch Replay