LLM RL is when you say advantage instead of weighting heuristic, policy instead of model and rollout instead of sample etc, for no good reason whatsoever.
Yoav Goldberg, AI2-Israel research director, argues reinforcement learning jargon in LLM discussions is unnecessary rebranding
He notes 'policy' and 'rollout' replace 'model' and 'sample'.
Most Activity
@yoavgo Here’s the case: if you accept the axiom that rule 1 is to be “as on-policy as possible” the design space changes. You now have tons of policies, and this new “group” thing that matters the most. Tokens and weights get abstracted away. Terminology wasn’t scaling.
LLM RL is when you say advantage instead of weighting heuristic, policy instead of model and rollout instead of sample etc, for no good reason whatsoever.
@srush_nlp why can't i say "learn with the freshest gradients", "be as close as possible to the representative model" etc?
@yoavgo Here’s the case: if you accept the axiom that rule 1 is to be “as on-policy as possible” the design space changes. You now have tons of policies, and this new “group” thing that matters the most. Tokens and weights get abstracted away. Terminology wasn’t scaling.