/AI10h ago

Yoav Goldberg, AI2-Israel research director, argues reinforcement learning jargon in LLM discussions is unnecessary rebranding

He notes 'policy' and 'rollout' replace 'model' and 'sample'.

--0--
Original post

LLM RL is when you say advantage instead of weighting heuristic, policy instead of model and rollout instead of sample etc, for no good reason whatsoever.

10:05 PM · Jun 3, 2026 · 5.3K Views
Sentiment
Sentiment building, check back later.
Cluster Engagement
-
Views
-
Comments
-
Reposts
-
Bookmarks
Expand data
Posts from X
Most Activity
Most ActivityTimeline
VIEWS1.1KBOOKMARKS2LIKES15
Sasha Rush@srush_nlp

@yoavgo Here’s the case: if you accept the axiom that rule 1 is to be “as on-policy as possible” the design space changes. You now have tons of policies, and this new “group” thing that matters the most. Tokens and weights get abstracted away. Terminology wasn’t scaling.

LLM RL is when you say advantage instead of weighting heuristic, policy instead of model and rollout instead of sample etc, for no good reason whatsoever.

3hViews 1.1KLikes 15Bookmarks 2
REPLIES2

@srush_nlp why can't i say "learn with the freshest gradients", "be as close as possible to the representative model" etc?

Sasha Rush@srush_nlp

@yoavgo Here’s the case: if you accept the axiom that rule 1 is to be “as on-policy as possible” the design space changes. You now have tons of policies, and this new “group” thing that matters the most. Tokens and weights get abstracted away. Terminology wasn’t scaling.

2hViews 331Likes 2Bookmarks 1