I'm really hyped about this paper, which creates contrastive preference pairs where the preferred response conditions on the correct prompt, and the rejected response conditions on either a random prompt or a prompt missing information.
Using these with DPO fine-tuning enables squeezing more performance out of already heavily fine-tuned models, across personalization (+4-35%) and reasoning benchmarks (+0-8%). And it comes for free, with no additional training data, labels, or verifiers.
We prove this is equivalent to maximizing the mutual information between the prompt and response under the reference policy.