2h ago

Sasha Rush and Yoav Goldberg argue prompt conditioning decouples machine learning models from policies

2 top authors

Goldberg suggests reframing methods like GRPO in ML terms

Original post

@yoavgo So one interesting example where model != policy is when you have special conditioning. People are playing with this a lot in self distillation, where the same weights are treated as different policies based on the prompt.

5:38 AM · Jun 4, 2026

2 more posts

Sasha Rush@srush_nlp·2hReply
@yoavgo The case where “freshest gradients” breaks down is in async RL where you actually need to think of the weights as a temporal sequence and name them.
View on
(((ل()(ل() 'yoav))))👾@yoavgo·2hReply
i think treating different prompts as inducing different policies is a great example actually, because this is to my understanding different from how (modern-ish) RL people will consider a policy, which is much closer to "a set of weights that determines actions". and then talking about "reference policy" and taking gradients wrt to that makes the sense of "policy" that you talk about *less* visible. so for example if we were to describe GRPO in "regular ML" terms, and then in departures from it we would say "and now instead of using the *model* we will use a behavior, which is a combination of model and prompt", i think this will actually help speed things up in terms of progress. (not sure if this came out very clear, i am on the go between things. happy to elaborate if needed)