/Tech5h ago

AI Builder Defends Off-Policy RL With Edited Base Model Data For Alignment

113042.3K

Original post

it’s fine + valid off-policy RL if you’re using a labeled / filtered / surgically edited (with caveats) set of completions from the base model, esp for alignment stuff where you’re not trying to explore anyway

but if the source is something else, it’s like what are you doing lol

will brown@willccbb

victor wembanyana studying magnus carlsen endgame losses so he can avoid making the same mistakes

12:40 AM · Jun 13, 2026 · 1.5K Views

Sentiment

Sentiment building, check back later.

Cluster Engagement

Posts from X

Most Activity

VIEWS780BOOKMARKS2LIKES6

will brown@willccbb

but if you’re doing that anyway from a base model, why not just do RL or self-distill? bigger batch + fewer steps if you’re worried about hacks?

caveats on editing = v similar to caveats on self-distillation context. be careful + don’t expect magic if you’re pushing it that far

will brown@willccbb

but if the source is something else, it’s like what are you doing lol

5h78062