Learning to Disprove: LLMs Fine-Tuned for Formal Counterexample Generation in Lean 4

Researchers propose a method to train LLMs for formal counterexample generation, a neglected skill in mathematical AI. Their symbolic mutation strategy and multi-reward framework improve performance on three new benchmarks.

AAAla SMITH & AI Research Desk·Mar 23, 2026·7 min read··181 views·AI-Generated·Report error

Source: arxiv.orgvia arxiv_aiCorroborated

Learning to Disprove: Formal Counterexample Generation with Large Language Models

A new research paper, "Learning to Disprove: Formal Counterexample Generation with Large Language Models," addresses a significant asymmetry in AI-assisted mathematical reasoning. While substantial effort has been directed toward training models to construct proofs, the complementary skill of finding counterexamples to disprove false statements has been largely overlooked. The authors formalize this task and introduce a training framework that enables large language models (LLMs) to not only propose counterexamples but also produce formal, machine-verifiable proofs of their validity using the Lean 4 theorem prover.

What the Researchers Built: A Framework for Disproof

The core contribution is a complete pipeline for training LLMs to perform formal counterexample generation. The task is defined as: given a false mathematical statement, the model must output both a candidate counterexample and a formal proof, written in Lean 4, that verifies the candidate indeed violates the statement. This moves beyond informal reasoning or simple guesswork, requiring the model to engage in rigorous, verifiable formal logic.

To tackle the scarcity of training data for this specific task, the team developed a symbolic mutation strategy. This data synthesis method operates on a corpus of true theorems (e.g., from mathlib, Lean's extensive mathematical library). It systematically extracts a theorem, selectively discards one or more of its hypotheses, and thereby creates a new, false statement. The original theorem's proof context often provides a direct counterexample to this mutated false statement. This process generates diverse, high-quality training pairs of (false statement, counterexample + formal proof).

The Training Framework: Multi-Reward Expert Iteration

The synthesized and curated data fuels a multi-reward expert iteration training framework. This is not a simple supervised fine-tuning run. The framework uses multiple reward signals to guide the model's learning, likely including rewards for:

Proposing a syntactically valid candidate counterexample.
Generating a syntactically valid Lean 4 proof sketch.
Producing a proof that successfully verifies in the Lean 4 kernel.

Table 2: Results of mutation ratio and average execution time. Evaluated across multiple data sources, our method mainta

Expert iteration—a process where a model generates solutions, an expert (or verifier) evaluates them, and the successful solutions are added back to the training set—is employed to iteratively improve the model's capability. The paper claims this framework "substantially enhances both the effectiveness and efficiency of training."

Key Results and Benchmarks

The methodology was evaluated on three newly collected benchmarks designed to test counterexample generation. While the source abstract does not provide specific numerical results, it states that experiments "validate the advantages of our approach, showing that the mutation strategy and training framework yield significant performance gains."

The critical comparison would be against baseline LLMs (like GPT-4, Claude, or open-weight models) prompted to perform the same task without specialized training, and potentially against models trained only for theorem proving. The implied result is that the proposed training pipeline leads to a marked improvement in the model's ability to correctly generate and formally verify counterexamples.

How It Works: From Theorem Mutation to Verified Disproof

Data Synthesis via Symbolic Mutation:
- Input: A formally proven theorem T in Lean: ∀ (x : X), P(x) → Q(x).
- Mutation: Discard a hypothesis, creating a false statement T': ∀ (x : X), Q(x). This is now false because Q(x) may not hold without P(x).
- Counterexample Extraction: The original proof context for T often contains an object a and a proof of P(a) that was used to show Q(a). This a can serve as a counterexample to T' if we can show ¬Q(a) is not true (or find a different b where ¬(P(b) → Q(b)) holds). The process formalizes this extraction.

Figure 2: Task of formal counterexample generation. This task requires the LLM first to perform informal reasoning to id

Model Training & Inference:
- The model (architecture unspecified, but likely a decoder-only LLM like LLaMA or CodeLlama) is trained on the mutated (false statement, proof) pairs.
- At inference, given a new false statement, the model generates a candidate counterexample and a Lean 4 proof.
- The Lean 4 kernel acts as the verifier, providing a binary success/failure signal. A successful generation constitutes a formal disproof.

Why It Matters: Completing the Mathematical Reasoning Loop

Current AI systems like OpenAI's GPT-4, DeepSeek's Math models, and Lean-focused projects (e.g., ProofNet) are heavily optimized for forward proof search. This creates an imbalance. As the authors note, mathematical reasoning is a dual-process activity: verification (proof) and falsification (counterexample). A mathematician or a student doesn't just try to prove a conjecture; they also test it with examples and try to break it.

This work moves AI closer to being a balanced assistant in mathematical discovery. It could:

Accelerate Research: Help mathematicians quickly test conjectures by automatically searching for counterexamples, saving time on doomed proof attempts.
Improve Education: Provide students with not just verification of their correct proofs but also constructive, formal feedback showing why their incorrect statements are wrong.
Enhance Code Verification: The ability to generate counterexamples is directly applicable to formal methods in software engineering, helping find bugs in program specifications by disproving incorrect invariants or pre/post-conditions.

gentic.news Analysis

This paper is a sharp, necessary correction to the trajectory of AI-for-math research. The field's obsession with benchmark scores on proof datasets (like MiniF2F, MATH) has implicitly treated "mathematical reasoning" as synonymous with "proof generation." This work correctly identifies that as a fundamental limitation. The symbolic mutation strategy is clever because it bootstraps high-quality, verifiably correct training data from an existing corpus of truths—a classic move in environments where negative examples are scarce but positive ones are abundant.

Figure 1:Framework of counterexample training.In the data synthesis stage, the symbolic mutation drops the hypothesis

From an engineering perspective, the integration with Lean 4 is non-negotiable for rigor. It moves the task out of the realm of "the model said this is a counterexample"—which is just another claim—into the realm of formal verification. The trust is placed in the Lean kernel, not the LLM. The LLM becomes a heuristic search agent for the kernel, which is a more robust and scalable paradigm.

However, the major unanswered question is the generalization ceiling. The mutation strategy likely generates counterexamples that are structurally similar to the proofs in the training corpus. The true test is whether models trained this way can generate novel, surprising counterexamples to complex, unseen conjectures—the kind that lead to genuine mathematical insight. The next step will be to evaluate these models on hold-out sets of false statements from ongoing mathematical research, not just synthetically mutated ones. If they can succeed there, they transition from a useful tool to a potential collaborator.

Frequently Asked Questions

What is formal counterexample generation?

Formal counterexample generation is the task of, given a false mathematical statement, producing a specific object that violates the statement and a complete formal proof (in a system like Lean 4) that verifies the violation. It goes beyond informal explanation or example-giving to meet the standard of rigorous, machine-checkable mathematics.

How does the symbolic mutation strategy work?

The strategy creates training data by taking formally proven theorems, deliberately removing one or more of their necessary hypotheses to create a false statement. The context of the original proof often contains a natural counterexample to this new false statement. This method systematically generates a large set of (false statement, verified counterexample) pairs from a corpus of true theorems, solving the data scarcity problem.

Why is using Lean 4 important for this task?

Lean 4 is an interactive theorem prover with a small, trusted kernel that checks proofs for absolute correctness. By requiring models to output Lean 4 code, the work ensures that every successful model output is independently verifiable. The LLM's role is to search for a counterexample and proof; Lean's role is to certify it. This decoupling is crucial for building trustworthy systems.

How does this compare to AI models that only do theorem proving?

Models trained only for theorem proving are one-sided. They attempt to find a proof path for any given statement. A model trained for disproof adds a critical complementary skill: it can recognize when a statement is false and demonstrate why concretely. A complete mathematical reasoning AI would need to orchestrate both capabilities, attempting to prove a conjecture and, if that fails efficiently, searching for a counterexample instead.

Source: gentic.news · Mar 23, 2026 · author=Ala SMITH · citation.json

AI-assisted reporting. Generated by gentic.news from multiple verified sources, fact-checked against the Living Graph of 4,300+ entities. Edited by Ala SMITH.

Following this story?

Get a weekly digest with AI predictions, trends, and analysis — free.

AI Analysis

The research underscores a maturation in how we frame AI's role in formal domains. Instead of asking LLMs to "be correct," the paradigm shifts to using them as stochastic search algorithms within a verifiable formal system. This is the correct architectural pattern: the LLM proposes, the formal system disposes. The multi-reward expert iteration framework is particularly noteworthy. It likely uses sparse rewards (the final proof verification) alongside denser, shaping rewards (syntax validity, type-correctness of intermediate steps), which is essential for training in such a combinatorial output space. Practitioners should note the methodology's potential applicability beyond mathematics to any domain with a formal specification language, such as hardware verification (SystemVerilog) or protocol correctness (TLA+). The paper's most significant implication may be for dataset creation in niche formal tasks. The symbolic mutation strategy is a blueprint for generating infinite, high-quality, and *correct* training data from a finite seed of verified knowledge. This could be applied to train models for finding bugs in code (mutate correct programs), flaws in legal arguments (mutate sound legal rules), or errors in scientific hypotheses. The core idea—systematically breaking correct things to learn how to recognize broken things—is powerful and general.

#formal methods #machine learning #ai research

Mentioned in this article

large language models Lean 4

Enjoyed this article?