/Tech4h ago

Prime Intellect's kalomaze warns that VIMPO, a critic-free RL alternative to GRPO, requires a frozen reference model

The method estimates dense token-level value directly from log-probability ratios.

2201124.2K

#501

Original post

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex#501inTech

Maybe a better alternative to GRPO than simply falling back on PPO

Xuandong Zhao@xuandongzhao

Happy to introduce our latest work — VIMPO: Value-Implicit Policy Optimization for LLMs

Most RL methods for LLM training face a trade-off: · PPO-style methods use a value model (critic) for token-level credit, but critics are hard to train. · GRPO-style methods drop the critic, but give every token the same trajectory-level signal.

Can we get the best of both worlds? 🧵

10:25 PM · Jun 20, 2026 · 3.8K Views

Sentiment

Users expressed frustration with VIMPO for forcing the assumption of a frozen reference model during LLM RL training.

Pos

0.0%

Neg

100.0%

1 comments with sentiment.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Posts from X

Most Activity

VIEWS214LIKES4REPLIES1

kalomaze@kalomaze

@teortaxesTex >forces the assumption of a frozen reference model uuuugghhhhhhhhhhhhhhhh

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

Maybe a better alternative to GRPO than simply falling back on PPO

1h21440

kalomaze@kalomaze

@teortaxesTex wrt serious PPO-replacement-contenders. why not start off with ideas that are like. not regressing a possibly multimodal return estimate via MSE unimodally? distributional RL, but make it LM. something like that. gotta toy w/ ideas like these at some point

1h1554

kalomaze@kalomaze

@teortaxesTex we should stop doing advantage estimation under gaussian assumptions that simply do not apply to "literally all of combinatorial language", this is the product of a path dependency that isn't killing us but is probably limiting enough to be leaving expressivity on the table

1h62