/AI11h ago

Alexey Gorbatovski explains how the TRB paper's OPD method uses a KL-divergence warmup to manage weak initial student rollouts

Will Brown notes SFT is conceptually equivalent to OPD.

--0--
Original posts
Comments
Reposts
Original postRishabh Agarwal#196

[1/7] OPD has a simple post-training loop: sample from the student, label with the teacher, repeat.

The awkward part is the start. The first rollouts come from the weakest version of the student, and training begins there.

TRB Paper: http://arxiv.org/pdf/2605.31159

5:56 AM · Jun 1, 2026 · 1.4K Views
Sentiment
Sentiment unavailable for this story.
Cluster Engagement
-
Views
-
Comments
-
Reposts
-
Bookmarks
Expand data
Posts from X
Most Activity
Most ActivityTimeline
VIEWS2.2KBOOKMARKS6LIKES27RETWEETS1REPLIES11
will brown@willccbb

sft is just opd where the actor is the teacher instead of the student

1hViews 2.2KLikes 27Bookmarks 6
Alexey Gorbatovski explains how the TRB paper's OPD method uses a KL-divergence warmup to manage weak initial student rollouts · Digg