/Tech22h ago

Chen Wu introduces a self-verification training method that doubles math accuracy and delivers a 14-fold scientific reasoning gain

The approach utilizes both training-time and test-time compute.

113474434225.9K

Original post unavailable.

Sentiment

Sentiment building, check back later.

Cluster Engagement

Posts from X

Most Activity

VIEWS39LIKES1REPLIES1

Chen Wu@ChenHenryWu

Next, we ask whether verifier-in-the-loop (ViL) training can improve the generator itself, especially after standard RLVR has saturated.

With test-time verification, we see a 33% pass@1 gain, as expected because the generator learns to use verifier output.

The surprise: even without a verifier at inference, standalone pass@1 improves by 30%. 4/5

22h391

Chen Wu@ChenHenryWu

We want verification to tell not just whether a solution is wrong, but where and why, so self-improvement has a direction. But how should we train for that?

Our key idea: show the model the reference solution so it has more context to teach itself to reason about where the errors are and why. We call this self-trained verification (STV). We then put it in the loop to improve at both test and training time. 2/5

22h321

Chen Wu@ChenHenryWu

At test time, the trained verifier makes refinement actually scale. STV roughly doubles pass@1 on hard math, and with enough verification compute, STV-guided 8B even beats a 4× larger model. 3/5

22h291

Chen Wu@ChenHenryWu

We see learning to verify as a promising lever for training- and test-time self-improvement, perhaps as important as scaling data and generators.

Check our paper for more results on more domains, weak-to-strong verification, ablations.

📄 paper: http://arxiv.org/abs/2605.30290 💻 code: https://github.com/ar-forum/stv 🌐 site: http://ar-forum.github.io/stv-webpage

This work is done at CMU with @AdtRaghunathan. 5/5

22h15