FormalQualBench Comparator Verifies Lean Proofs for Correctness and No Extra Axioms
A key component of FormalQualBench is Comparator, which rigorously checks that each solution proves the correct statement, introduces no additional axioms, and is accepted by the lean kernel. Comparator detect sophisticated workarounds that evade basic compilation checks.
This led us to develop FormalQualBench (https://www.math.inc/formalqualbench), a benchmark designed to reinforce correctness standards across the field. With statements checked by a human expert, our goal is to guarantee that all proofs are faithful to the underlying mathematics.
In our evaluations, models like Codex employed elaborator-level tactics to bypass constraints. One example shows a Codex-generated snippet using "ax" ++ "iom" to inject an axiom via metaprogramming. This evades static detection but is reliably caught by Comparator.

A key component of FormalQualBench is Comparator, which rigorously checks that each solution proves the correct statement, introduces no additional axioms, and is accepted by the lean kernel. Comparator detect sophisticated workarounds that evade basic compilation checks.