/Tech2h ago

Researcher Suggests Distributional RL as PPO Alternative for Language Models

1600340

Original post

@teortaxesTex wrt serious PPO-replacement-contenders. why not start off with ideas that are like. not regressing a possibly multimodal return estimate via MSE unimodally? distributional RL, but make it LM. something like that. gotta toy w/ ideas like these at some point

kalomaze@kalomaze

@teortaxesTex >forces the assumption of a frozen reference model uuuugghhhhhhhhhhhhhhhh

1:19 AM · Jun 21, 2026 · 240 Views

Sentiment

Users criticized Gaussian assumptions in advantage estimation for language models as invalid for combinatorial spaces, supporting distributional RL alternatives to PPO.

Pos

0.0%

Neg

100.0%

1 comments with sentiment.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Posts from X

Most Activity

VIEWS100LIKES1

kalomaze@kalomaze

@teortaxesTex we should stop doing advantage estimation under gaussian assumptions that simply do not apply to "literally all of combinatorial language", this is the product of a path dependency that isn't killing us but is probably limiting enough to be leaving expressivity on the table

kalomaze@kalomaze

2h10010