MIT CSAIL PhD student Ryan Bahlous-Boldi introduces Vector Policy Optimization, a reinforcement learning method that maximizes vector-valued rewards to preserve distinct objectives in LLM post-training
VPO raises pass@k scores on LiveCodeBench over GRPO baselines.
@RyanBoldi update. i'm glad that a whopping 2000 of y'all listened to this, and you can now read the release at:
Your RL post-training may be sabotaging your LLM’s test-time scaling! Conventional RL pretends that you can collapse all reward signals *upfront* into a single *scalar reward*. We introduce Vector Policy Optimization (VPO), which natively maximizes *vector-valued* rewards, boosting test time search performance, even on the original scalar.
RL has almost always meant trying to maximize a scalar reward.
Very expressive in theory, but do you have only ONE scalar reward? Preferences & tradeoffs are complex & high-dimensional!
Vector Policy Optimization (VPO) trains LLMs to anticipate diverse environments and goals!
Your RL post-training may be sabotaging your LLM’s test-time scaling! Conventional RL pretends that you can collapse all reward signals *upfront* into a single *scalar reward*. We introduce Vector Policy Optimization (VPO), which natively maximizes *vector-valued* rewards, boosting test time search performance, even on the original scalar.
Check out their new cool work!
Your RL post-training may be sabotaging your LLM’s test-time scaling! Conventional RL pretends that you can collapse all reward signals *upfront* into a single *scalar reward*. We introduce Vector Policy Optimization (VPO), which natively maximizes *vector-valued* rewards, boosting test time search performance, even on the original scalar.
Great work on Vector Policy Optimization (VPO).
The standard scalar-reward view of post-training is inherently lossy: compressing a trajectory into one number discards a lot of useful structure, such as which sub-goals were met, where the reasoning failed, and what tradeoffs were made.
VPO makes this point concretely by using vector-valued rewards to train models that cover different regions of the reward space. This is especially relevant as inference-time search becomes more important where diversity is not just beneficial but directly useful.
This also connects to the success of methods like GEPA: richer feedback representations, whether vector-valued or semantic/natural-language, can carry much more optimization signal than sparse scalar rewards. More broadly, we may need to rethink “reward” as a structured feedback object rather than a single compressed scalar.
Your RL post-training may be sabotaging your LLM’s test-time scaling! Conventional RL pretends that you can collapse all reward signals *upfront* into a single *scalar reward*. We introduce Vector Policy Optimization (VPO), which natively maximizes *vector-valued* rewards, boosting test time search performance, even on the original scalar.