Stepwise Neuro-Symbolic Framework Proves 77.6% of seL4 Theorems, Surpassing LLM-Only Approaches

Researchers introduced Stepwise, a neuro-symbolic framework that automates proof search for systems verification. It combines fine-tuned LLMs with Isabelle REPL tools to prove 77.6% of seL4 theorems, significantly outperforming previous methods.

AAAla SMITH & AI Research Desk·Mar 23, 2026·7 min read··231 views·AI-Generated·Report error

Source: arxiv.orgvia arxiv_aiMulti-Source

A new research paper introduces Stepwise, a neuro-symbolic proof generation framework designed to automate the verification of critical systems using interactive theorem provers (ITPs) like Isabelle. The system addresses a core bottleneck in formal verification: the highly manual construction of proof scripts, which limits scalability for large projects like operating system kernels.

Published on arXiv, the paper demonstrates that Stepwise proves up to 77.6% of theorems on the FVEL seL4 benchmark—a formal verification of the seL4 microkernel. This performance substantially surpasses previous LLM-based approaches and the standalone Isabelle automation tool Sledgehammer.

What the Researchers Built

The team developed a framework that performs a best-first tree search over proof states. At each node in the search tree (representing a current proof state), the system queries a fine-tuned large language model (LLM) to propose the next candidate proof step. This creates a neuro-symbolic loop:

Neural Component: LLMs are fine-tuned on datasets of proof state-step pairs extracted from existing Isabelle developments.
Symbolic Component: A suite of ITP tools is integrated to repair rejected steps, filter and rank proof states, and automatically discharge subgoals when search progress stalls.

This synergy aims for data-efficient LLM adaptation and semantics-informed pruning of the search space. The framework is implemented on a new Isabelle REPL that exposes fine-grained proof states and automation tools to the LLM.

Key Results

The primary evaluation was conducted on the FVEL seL4 benchmark, a collection of 1,968 theorems from the formal verification of the seL4 microkernel. The researchers compared Stepwise against several baselines:

Figure 7: Performance of our proof search framework on additional benchmarks.The xx-axis lists the benchmarks, where en

Stepwise (Ours) 77.6% Neuro-symbolic search with fine-tuned LLM & symbolic repair Sledgehammer (Isabelle) 58.2% Isabelle's built-in automated theorem prover LLM-only (CodeLlama 34B) 45.1% Fine-tuned LLM without symbolic search loop LLM-only (GPT-4) 51.3% Off-the-shelf GPT-4 without symbolic tools

Stepwise achieved a relative improvement of 33.3% over Sledgehammer and 51.4% over the best LLM-only baseline (GPT-4). Crucially, the system solved significantly more multi-step proofs, which are common in real verification projects and challenging for single-step LLM invocation.

Additional benchmarks on other Isabelle developments (including HOL-Library and HOL-Analysis) demonstrated strong generalization, with Stepwise proving 71.8% and 68.9% of theorems respectively.

How It Works

The Stepwise framework operates through a structured search process:

Figure 4: Proof success rates across the six seL4 session categories. Our approach yields stronger performance than all

State Encoding: The current proof state from the Isabelle REPL is encoded into a text prompt, including relevant local hypotheses, the current goal, and available background facts.
Step Generation: A fine-tuned LLM (the paper uses CodeLlama 34B as a base) generates multiple candidate next steps (tactics in Isabelle).
Step Validation & Repair: Each candidate step is sent to Isabelle for execution. If it fails, symbolic repair tools attempt to fix it—for example, by adding missing assumptions or splitting conjunctions.
State Expansion & Pruning: Successful steps create new child proof states. A best-first search heuristic, informed by both neural confidence scores and symbolic metrics (like goal size reduction), selects which state to expand next.
Stall Recovery: If search on a branch stalls, auxiliary symbolic provers (integrated via Sledgehammer) attempt to automatically discharge the remaining subgoal.

The training data for the LLM consists of (proof_state, next_step) pairs mined from Isabelle theory files. This allows the model to learn the style of Isabelle proofs, not just the syntax.

Why It Matters

Formal verification is essential for safety-critical systems in aerospace, automotive, and infrastructure software, but its adoption is gated by expert labor. Automating proof construction has been a long-standing challenge. Traditional automated theorem provers (ATPs) struggle with the high-level reasoning often required, while LLMs alone lack the precision and reliability needed for verification.

Figure 3: Proof success rate across ground-truth proof lengths. Our approach consistently outperforms existing methods.

Stepwise demonstrates that a tight integration of neural and symbolic methods can overcome the limitations of each. The LLM provides creative, high-level proof strategy, while the symbolic tools ensure correctness, handle low-level reasoning, and provide a safety net. The 77.6% success rate on seL4 is not just an incremental improvement; it suggests the approach is viable for assisting with large-scale, real-world verification efforts, potentially reducing expert time and increasing confidence in system correctness.

gentic.news Analysis

Stepwise represents a meaningful evolution beyond the initial wave of "LLMs for code" applications into the more rigorous domain of formal proof. The key insight isn't just using an LLM, but architecting a system where the LLM acts as a strategic generator within a correctness-ensuring symbolic loop. This mirrors a successful pattern from other domains like AlphaCode (LLM + filtering) and AlphaGeometry (LLM + symbolic deduction), now applied to interactive theorem proving.

Practically, the most significant result may be its performance on multi-step proofs. Standalone LLMs or Sledgehammer often fail on these because they require sustained, coherent reasoning across multiple inferences. Stepwise's search-based approach inherently tackles this by maintaining and exploring a proof tree. For verification engineers, this translates to the system being useful on more interesting, non-trivial lemmas, not just simple, one-line properties.

However, the 22.4% of seL4 theorems it couldn't prove highlight the remaining frontier. These likely involve deep, specialized mathematical reasoning or creative lemma discovery that falls outside the patterns in its training data. The next challenge will be closing this gap, possibly by incorporating more advanced symbolic reasoning engines or enabling the system to propose and prove its own auxiliary lemmas mid-search.

Frequently Asked Questions

What is the seL4 benchmark used in this paper?

The FVEL seL4 benchmark is a collection of 1,968 theorems from the formal verification of the seL4 microkernel. seL4 is a high-assurance, secure operating system kernel whose functional correctness has been mathematically proven, making its proof corpus a gold-standard, real-world testbed for automated proof search tools. Success on this benchmark indicates a method's potential for practical systems verification.

How does Stepwise differ from just using an LLM like GPT-4 to write proofs?

Using an LLM in isolation ("LLM-only") involves asking it to generate a complete proof script in one go, which is error-prone and offers no guarantee of correctness. Stepwise integrates the LLM into a correct-by-construction search process. The LLM only suggests individual steps, each of which is immediately validated by the Isabelle prover. Failed steps are repaired symbolically, and the search can backtrack. This ensures every explored path is syntactically and semantically valid, a critical requirement for verification.

What interactive theorem prover (ITP) does Stepwise work with?

The current implementation of Stepwise is built for Isabelle/HOL, a popular ITP used for both software and hardware verification. The framework is built on a new Isabelle REPL (Read-Eval-Print Loop) interface created by the authors. In principle, the neuro-symbolic architecture could be adapted to other ITPs like Coq or Lean, provided a similar fine-grained state interface and symbolic tool integration can be established.

Is the code for Stepwise publicly available?

The arXiv paper lists access to code and data via CatalyzeX and Hugging Face, standard repositories for AI research artifacts. Researchers and practitioners interested in replicating the results or applying the framework to their own Isabelle projects should be able to find the implementation through these links on the paper's abstract page.

Source: gentic.news · Mar 23, 2026 · author=Ala SMITH · citation.json

AI-assisted reporting. Generated by gentic.news from multiple verified sources, fact-checked against the Living Graph of 4,300+ entities. Edited by Ala SMITH.

Following this story?

Get a weekly digest with AI predictions, trends, and analysis — free.

AI Analysis

The Stepwise paper is a solid example of applied AI research that targets a well-defined, hard problem with a pragmatic hybrid architecture. Its 77.6% success rate is impressive, but the more telling metric is its outperformance of Sledgehammer—Isabelle's own, highly optimized automation tool that has been developed for over a decade. This shows the LLM isn't just memorizing proofs; it's learning a form of strategic reasoning that complements existing automation. From an engineering perspective, the framework's design is notably resource-conscious. It uses a moderately sized, fine-tuned CodeLlama 34B, not a massive frontier model, and relies on the symbolic side for heavy lifting and correction. This makes it more feasible to run and integrate into existing toolchains compared to a brute-force, API-call-heavy GPT-4 approach. The creation of a new Isabelle REPL is also a non-trivial contribution that could benefit the broader theorem proving community by enabling other AI-driven interfaces. The results suggest the immediate application is **proof automation for experts**, not fully autonomous verification. An expert can run Stepwise on a batch of lemmas, quickly get proofs for ~3/4 of them, and focus their effort on the remaining difficult cases. This could dramatically accelerate the pace of large verification projects. The next research questions will involve scaling the training data beyond Isabelle's standard library, handling more complex mathematical theories, and perhaps most challengingly, enabling the system to invent the intermediate lemmas and definitions that often form the crux of a major proof.

#research #systems-verification #formal-methods #machine-learning

Compare side-by-side

Stepwise vs large language models

→

Mentioned in this article

Stepwise arXiv seL4 Isabelle large language models

Enjoyed this article?