Been a while since we've had a paper on provers. This "Defense-in-Depth Verifier" is actually a clever trick. Most of the paper is dedicated to defeating reward hacks. An exemplary work on what actually goes into "RL environments".
With the MaxProof framework, M3 exceeded the human gold-medal threshold on both sets. In this paper, we go deeper into the technical path behind our progress in mathematical proof: improving the base model, aligning a verifier, building refinement capability, and designing the test-time scaling framework MaxProof.
Here it is 《MaxProof: Scaling Mathematical Proof with Generative-Verifier RL and Evolutionary Search》 https://huggingface.co/papers/2606.13392
