it’s fine + valid off-policy RL if you’re using a labeled / filtered / surgically edited (with caveats) set of completions from the base model, esp for alignment stuff where you’re not trying to explore anyway
but if the source is something else, it’s like what are you doing lol
victor wembanyana studying magnus carlsen endgame losses so he can avoid making the same mistakes