7h ago

MIT CSAIL PhD student Ryan Bahlous-Boldi introduces Vector Policy Optimization, a reinforcement learning method that maximizes vector-valued rewards to preserve distinct objectives in LLM post-training

VPO raises pass@k scores on LiveCodeBench over GRPO baselines.

3
Original post

Your RL post-training may be sabotaging your LLM’s test-time scaling! Conventional RL pretends that you can collapse all reward signals *upfront* into a single *scalar reward*. We introduce Vector Policy Optimization (VPO), which natively maximizes *vector-valued* rewards, boosting test time search performance, even on the original scalar.

8:33 AM · May 22, 2026 View on X
Reposted by

@RyanBoldi update. i'm glad that a whopping 2000 of y'all listened to this, and you can now read the release at:

Ryan Bahlous-BoldiRyan Bahlous-Boldi@RyanBoldi

Your RL post-training may be sabotaging your LLM’s test-time scaling! Conventional RL pretends that you can collapse all reward signals *upfront* into a single *scalar reward*. We introduce Vector Policy Optimization (VPO), which natively maximizes *vector-valued* rewards, boosting test time search performance, even on the original scalar.

3:33 PM · May 22, 2026 · 59.7K Views
4:06 PM · May 22, 2026 · 644 Views

RL has almost always meant trying to maximize a scalar reward.

Very expressive in theory, but do you have only ONE scalar reward? Preferences & tradeoffs are complex & high-dimensional!

Vector Policy Optimization (VPO) trains LLMs to anticipate diverse environments and goals!

Ryan Bahlous-BoldiRyan Bahlous-Boldi@RyanBoldi

Your RL post-training may be sabotaging your LLM’s test-time scaling! Conventional RL pretends that you can collapse all reward signals *upfront* into a single *scalar reward*. We introduce Vector Policy Optimization (VPO), which natively maximizes *vector-valued* rewards, boosting test time search performance, even on the original scalar.

3:33 PM · May 22, 2026 · 59.7K Views
4:03 PM · May 22, 2026 · 13.8K Views

Check out their new cool work!

Ryan Bahlous-BoldiRyan Bahlous-Boldi@RyanBoldi

Your RL post-training may be sabotaging your LLM’s test-time scaling! Conventional RL pretends that you can collapse all reward signals *upfront* into a single *scalar reward*. We introduce Vector Policy Optimization (VPO), which natively maximizes *vector-valued* rewards, boosting test time search performance, even on the original scalar.

3:33 PM · May 22, 2026 · 59.7K Views
5:44 PM · May 22, 2026 · 1.2K Views

Great work on Vector Policy Optimization (VPO).

The standard scalar-reward view of post-training is inherently lossy: compressing a trajectory into one number discards a lot of useful structure, such as which sub-goals were met, where the reasoning failed, and what tradeoffs were made.

VPO makes this point concretely by using vector-valued rewards to train models that cover different regions of the reward space. This is especially relevant as inference-time search becomes more important where diversity is not just beneficial but directly useful.

This also connects to the success of methods like GEPA: richer feedback representations, whether vector-valued or semantic/natural-language, can carry much more optimization signal than sparse scalar rewards. More broadly, we may need to rethink “reward” as a structured feedback object rather than a single compressed scalar.

Ryan Bahlous-BoldiRyan Bahlous-Boldi@RyanBoldi

Your RL post-training may be sabotaging your LLM’s test-time scaling! Conventional RL pretends that you can collapse all reward signals *upfront* into a single *scalar reward*. We introduce Vector Policy Optimization (VPO), which natively maximizes *vector-valued* rewards, boosting test time search performance, even on the original scalar.

3:33 PM · May 22, 2026 · 59.7K Views
6:22 PM · May 22, 2026 · 1.5K Views
MIT CSAIL PhD student Ryan Bahlous-Boldi introduces Vector Policy Optimization, a reinforcement learning method that maximizes vector-valued rewards to preserve distinct objectives in LLM post-training · Digg