Google's LEAP agentic scaffold lifts Lean-IMO-Bench one-shot solve rate from under 10% to 70%. The same general-purpose LLM solves all 12 Putnam 2025 problems, beating a specialized gold-medal system scoring 48%.
Key facts
- Lean-IMO-Bench one-shot solve rate: <10% to 70%.
- All 12 Putnam 2025 problems solved by same model.
- Beats specialized gold-medal system scoring 48%.
- Lean compiler grounds every step with verifier feedback.
- arXiv preprint: https://t.co/bh4Yoi19E2
Google researchers have developed LEAP, an agentic scaffold that wraps a general-purpose LLM in a Lean compiler feedback loop. According to @omarsar0 and the arXiv preprint, the system grounds every generation step in the Lean theorem prover's verifier, iterating on compiler error messages until the proof checks out.
The result is striking: the same underlying model solves all 12 problems from the 2025 William Lowell Putnam Mathematical Competition — the premier undergraduate math contest — and pushes Lean-IMO-Bench one-shot accuracy from under 10% to 70%. That beats a specialized gold-medal system that achieves only 48% on the same benchmark.
How LEAP differs from prior work
Prior agentic math systems either fine-tuned models on proof corpora or used hand-crafted search strategies. LEAP instead treats the Lean compiler as a hard verifier: each proposed proof step is compiled, the error message is fed back to the LLM, and the model revises until the step type-checks. This eliminates the need for curated training data or domain-specific fine-tuning.
The scaffold does not require a specialized math model — it works with a general-purpose LLM, suggesting the bottleneck in automated theorem proving is not model capability but the absence of structured feedback loops. The paper does not disclose which general-purpose LLM was used, nor the total compute cost of the scaffolded runs.
Implications for agentic AI
LEAP joins a growing line of work showing that agentic scaffolds can unlock latent reasoning from base models. The 70% one-shot solve rate on Lean-IMO-Bench is the highest reported for a general-purpose system, though the benchmark's problem set is small (IMO problems translated to Lean). The Putnam result is anecdotal — all 12 problems solved, but without a formal benchmark leaderboard.
Still, the gap between a general model plus scaffold (70%) and a specialized system (48%) suggests that compiler-grounded iteration may be more valuable than domain-specific training. The approach generalizes: any formal verification environment (Lean, Coq, Isabelle) could serve as the verifier for other reasoning domains.
What to watch
Watch for the arXiv paper's full ablation on which general-purpose LLM was used, and whether the scaffold generalizes to other formal environments (Coq, Isabelle) without modification. A public leaderboard for Putnam problems would validate the claim.








