MIT CSAIL PhD student Ryan Bahlous-Boldi introduces Vector Policy Optimization, a reinforcement learning method that maximizes vector-valued rewards to preserve distinct objectives in LLM post-training

VIEWS55KBOOKMARKS468LIKES594REPLIES11

It's never made sense to me that RL collapses all reward signals to a single scalar. Today, we fix that!

Introducing Vector Policy Optimization: we train models to inherently optimize for the varied nature of a reward vector, creating diverse sets of answers ideal for test time search. Website and code coming soon!

Ryan Bahlous-Boldi@RyanBoldi

Your RL post-training may be sabotaging your LLM’s test-time scaling!

Conventional RL pretends that you can collapse all reward signals *upfront* into a single *scalar reward*. We introduce Vector Policy Optimization (VPO), which natively maximizes *vector-valued* rewards, boosting test time search performance, even on the original scalar.

38d55K594468

RETWEETS86

Ryan Bahlous-Boldi@RyanBoldi

Your RL post-training may be sabotaging your LLM’s test-time scaling!

Conventional RL pretends that you can collapse all reward signals *upfront* into a single *scalar reward*. We introduce Vector Policy Optimization (VPO), which natively maximizes *vector-valued* rewards, boosting test time search performance, even on the original scalar.

38d179.7K795735

Omar Khattab@lateinteraction

RL has almost always meant trying to maximize a scalar reward.

Very expressive in theory, but do you have only ONE scalar reward? Preferences & tradeoffs are complex & high-dimensional!

Vector Policy Optimization (VPO) trains LLMs to anticipate diverse environments and goals!

Ryan Bahlous-Boldi@RyanBoldi

Your RL post-training may be sabotaging your LLM’s test-time scaling!

Conventional RL pretends that you can collapse all reward signals *upfront* into a single *scalar reward*. We introduce Vector Policy Optimization (VPO), which natively maximizes *vector-valued* rewards, boosting test time search performance, even on the original scalar.

38d28.7K330246

Pulkit Agrawal@pulkitology

If the goal is test-time search, post-training should optimize for diversity and not just rewards. Introducing Vector Policy Optimization (VPO).

Ryan Bahlous-Boldi@RyanBoldi

Your RL post-training may be sabotaging your LLM’s test-time scaling!

Conventional RL pretends that you can collapse all reward signals *upfront* into a single *scalar reward*. We introduce Vector Policy Optimization (VPO), which natively maximizes *vector-valued* rewards, boosting test time search performance, even on the original scalar.

37d7.7K4649

Soheil Feizi@FeiziSoheil

Great work on Vector Policy Optimization (VPO).

The standard scalar-reward view of post-training is inherently lossy: compressing a trajectory into one number discards a lot of useful structure, such as which sub-goals were met, where the reasoning failed, and what tradeoffs were made.

VPO makes this point concretely by using vector-valued rewards to train models that cover different regions of the reward space. This is especially relevant as inference-time search becomes more important where diversity is not just beneficial but directly useful.

This also connects to the success of methods like GEPA: richer feedback representations, whether vector-valued or semantic/natural-language, can carry much more optimization signal than sparse scalar rewards. More broadly, we may need to rethink “reward” as a structured feedback object rather than a single compressed scalar.

Ryan Bahlous-Boldi@RyanBoldi

Your RL post-training may be sabotaging your LLM’s test-time scaling!

Conventional RL pretends that you can collapse all reward signals *upfront* into a single *scalar reward*. We introduce Vector Policy Optimization (VPO), which natively maximizes *vector-valued* rewards, boosting test time search performance, even on the original scalar.

38d4K2514

Ryan Bahlous-Boldi@RyanBoldi

Joint work with @ishapuri101 @IdanShenfeld @akarshkumar0101 @MehulDamani2 @risi1979 @lateinteraction @ZhangWeiHong9 @pulkitology

paper: https://arxiv.org/abs/2605.22817

38d904188

MIT NLP@nlp_mit

@RyanBoldi @ishapuri101 and team show how optimizing for vector valued rewards is the best way to train models that will excel during test time search!

Ryan Bahlous-Boldi@RyanBoldi

Your RL post-training may be sabotaging your LLM’s test-time scaling!

Conventional RL pretends that you can collapse all reward signals *upfront* into a single *scalar reward*. We introduce Vector Policy Optimization (VPO), which natively maximizes *vector-valued* rewards, boosting test time search performance, even on the original scalar.

38d1.9K145

Han Guo@HanGuo97

Check out their new cool work!

Ryan Bahlous-Boldi@RyanBoldi

Your RL post-training may be sabotaging your LLM’s test-time scaling!

Conventional RL pretends that you can collapse all reward signals *upfront* into a single *scalar reward*. We introduce Vector Policy Optimization (VPO), which natively maximizes *vector-valued* rewards, boosting test time search performance, even on the original scalar.

38d2.5K203

Ryan Bahlous-Boldi@RyanBoldi

With VPO, we post-train LLMs to learn to anticipate many potential downstream preferences/rewards at once, and to generate them all as a diverse set.

The algorithm simply asks you not to collapse the naturally vector-valued rewards that already exist in your tasks!

38d896152

Ryan Bahlous-Boldi@RyanBoldi

VPO boosts performance on hard coding tasks, and is a better backbone for test-time discovery algorithms like AlphaEvolve

38d783142

Ryan Bahlous-Boldi@RyanBoldi

Methods like GRPO lead to entropy collapse, where models robotically lose output diversity. But inference scaling methods like AlphaEvolve & Best-of-N with task-specific rewards only work if the model produces rich candidates to select from! So far, post-training ignores this.

38d1K211

Ryan Bahlous-Boldi@RyanBoldi

Across reasoning, coding, navigation and tool use domains, VPO substantially improves best@k as inference time search scales.

38d584131

Ryan Bahlous-Boldi@RyanBoldi

Two key ingredients in VPO: [1] Random vector reward weightings, [2] In context exploration through multi-answer chains.

Instead of optimizing a single reward weighting, VPO trains models to cover different anticipated regions of reward space across a *set* of joint rollouts.

38d648121

Rohan Jha@Robro612

@lateinteraction Everything is MaxSim 😮

38d5.2K82

Tom Dupuis@bellmantd

@RyanBoldi @RyanBoldi how does this relate to rewarded-soup and other mixture of rewards work from @ramealexandre ? see https://arxiv.org/abs/2306.04488

38d40231

Omar Khattab@lateinteraction

@RyanBoldi update. i'm glad that a whopping 2000 of y'all listened to this, and you can now read the release at:

Ryan Bahlous-Boldi@RyanBoldi

Your RL post-training may be sabotaging your LLM’s test-time scaling!

Conventional RL pretends that you can collapse all reward signals *upfront* into a single *scalar reward*. We introduce Vector Policy Optimization (VPO), which natively maximizes *vector-valued* rewards, boosting test time search performance, even on the original scalar.

38d83811

Jenny Shen@JennyShen056

Great work! Excited to see vector-valued rewards used for diversity in test-time search. In our recent ICML paper (https://arxiv.org/pdf/2510.01167), we explored a related direction: vectorized rewards for multi-objective alignment, preserving distinct reward dimensions to enable controllable inference-time behavior.

38d4831

Amrit Singh Bedi@amritsinghbedi3

@RyanBoldi Great work!!

38d1521

Hassan Hayat 🔥@TheSeaMouse

@RyanBoldi As a card-carrying member of the anti-scalar-reward liberation front, I thank you for your work

38d3146

Maxime Rivest 🧙‍♂️🦙🐧@MaximeRivest

@Robro612 @lateinteraction hello

37d1142