/AI11h ago

Alexey Gorbatovski explains how the TRB paper's OPD method uses a KL-divergence warmup to manage weak initial student rollouts

Will Brown notes SFT is conceptually equivalent to OPD.

14438213.9K

Original posts

Comments

Reposts

Original post

Alexey Gorbatovski@AMyashka

[1/7] OPD has a simple post-training loop: sample from the student, label with the teacher, repeat.

The awkward part is the start. The first rollouts come from the weakest version of the student, and training begins there.

TRB Paper: http://arxiv.org/pdf/2605.31159

5:56 AM · Jun 1, 2026 · 1.4K Views

/AI11h ago

Alexey Gorbatovski explains how the TRB paper's OPD method uses a KL-divergence warmup to manage weak initial student rollouts

Will Brown notes SFT is conceptually equivalent to OPD.

--0--

Original posts

Comments

Reposts

Original post

Alexey Gorbatovski@AMyashka

[1/7] OPD has a simple post-training loop: sample from the student, label with the teacher, repeat.

The awkward part is the start. The first rollouts come from the weakest version of the student, and training begins there.

TRB Paper: http://arxiv.org/pdf/2605.31159

5:56 AM · Jun 1, 2026 · 1.4K Views

Sentiment

Users like the OPD method because it shows how stronger reasoning improves the everyday usefulness of models like Qwen.

Pos

100.0%

Neg

0.0%

1 comments with sentiment.

Cluster Engagement

Sentiment

Sentiment unavailable for this story.

Cluster Engagement

Views

Comments

Reposts

Bookmarks

Expand data

Posts from X

Most Activity

VIEWS2.2KBOOKMARKS6LIKES27RETWEETS1REPLIES11

will brown@willccbb

sft is just opd where the actor is the teacher instead of the student

1h2.2K276