
Next, we ask whether verifier-in-the-loop (ViL) training can improve the generator itself, especially after standard RLVR has saturated.
With test-time verification, we see a 33% pass@1 gain, as expected because the generator learns to use verifier output.
The surprise: even without a verifier at inference, standalone pass@1 improves by 30%. 4/5