Vector Policy Optimization (VPO) – a new policy optimization method that optimizes for a diverse set of good answers
Each one is good at different parts of the reward signal.
Here is how it works:
1. For a prompt and an answer, instead of using one reward score, VPO assumes a vector reward.
2. Each component measures one aspect of answer quality, e.g. different test cases, criteria, reasoning steps, or user preferences.
3. The model produces multiple candidate answers for the same prompt
4. VPO checks different ways of weighting the reward components, so different answers can be useful under different trade-offs
5. The model is rewarded when its set contains answers that are strong under different reward weightings.
6. This encourages the model to produce a collection of answers that work well together as a search pool.
Another good feature is that VPO can be also used inside GRPO → GRPO updates the model, but VPO decides what reward GRPO should optimize. VPO scores the whole answer set by checking which answers are best under different reward weightings, then uses that score for the GRPO update



