MINIF2F-DAFNY: LLM-Guided Mathematical Theorem Proving via Auto-Active Verification
Mantas Baksys, Stefan Zetzsche, Olivier Bouissou, Sean B. Holden
Code Available — Be the first to reproduce this paper.
ReproduceCode
- github.com/dafny-lang/minif2fOfficialIn paper★ 7
Abstract
LLMs excel at reasoning, but validating their steps remains challenging. Formal verification offers a solution through mechanically checkable proofs. Interactive theorem provers (ITPs) dominate mathematical reasoning but require detailed low-level proof steps, while auto-active verifiers offer automation but focus on software verification. Recent work has begun bridging this divide by evaluating LLMs for software verification in ITPs, but the complementary direction--LLMs for mathematical theorem proving in auto-active verifiers--remains unexplored. We present MINIF2F-DAFNY, the first translation of the widely-used mathematical benchmark miniF2F to an auto-active verifier: Dafny. We find that Dafny's automation alone solves 39-44% of problems with empty proofs, whereas many require substantial proof guidance in ITPs. For remaining problems, we evaluate 7 off-the-shelf LLMs, achieving 55.7% success with the best model (Claude Sonnet 4.5) using modest resources. These results demonstrate effective division of labor: LLMs provide high-level guidance while automation handles low-level details. Our benchmark can be found on GitHub at http://github.com/dafny-lang/miniF2F .