(((ل()(ل() 'yoav))))👾@yoavgo·Reply
i think treating different prompts as inducing different policies is a great example actually, because this is to my understanding different from how (modern-ish) RL people will consider a policy, which is much closer to "a set of weights that determines actions". and then talking about "reference policy" and taking gradients wrt to that makes the sense of "policy" that you talk about *less* visible.
so for example if we were to describe GRPO in "regular ML" terms, and then in departures from it we would say "and now instead of using the *model* we will use a behavior, which is a combination of model and prompt", i think this will actually help speed things up in terms of progress.
(not sure if this came out very clear, i am on the go between things. happy to elaborate if needed)