Maybe a better alternative to GRPO than simply falling back on PPO
Happy to introduce our latest work — VIMPO: Value-Implicit Policy Optimization for LLMs
Most RL methods for LLM training face a trade-off: · PPO-style methods use a value model (critic) for token-level credit, but critics are hard to train. · GRPO-style methods drop the critic, but give every token the same trajectory-level signal.
Can we get the best of both worlds? 🧵
