Great work on Vector Policy Optimization (VPO).
The standard scalar-reward view of post-training is inherently lossy: compressing a trajectory into one number discards a lot of useful structure, such as which sub-goals were met, where the reasoning failed, and what tradeoffs were made.
VPO makes this point concretely by using vector-valued rewards to train models that cover different regions of the reward space. This is especially relevant as inference-time search becomes more important where diversity is not just beneficial but directly useful.
This also connects to the success of methods like GEPA: richer feedback representations, whether vector-valued or semantic/natural-language, can carry much more optimization signal than sparse scalar rewards. More broadly, we may need to rethink “reward” as a structured feedback object rather than a single compressed scalar.