[1/7] OPD has a simple post-training loop: sample from the student, label with the teacher, repeat.
The awkward part is the start. The first rollouts come from the weakest version of the student, and training begins there.
TRB Paper: http://arxiv.org/pdf/2605.31159