the concept of a “DPO dataset” is honestly crazy
Will Brown of Prime Intellect argues DPO dataset categorization is counterintuitive, sparking debate on Bradley-Terry equivalence
Story Overview
Will Brown flagged the 'DPO dataset' label as odd when the preference pairs come from outside the base model, likening it to training on unrelated expert losses instead of the model's own outputs, and floated RL or self-distillation as possible swaps when generation is already happening.
Dataset Origins Shape The Workflow
Brown clarified that labeled completions from the model itself can still count as off-policy RL for alignment tasks, yet he questioned the framing for fully external sources and noted editing caveats mirror those in self-distillation.
Same Data, Different Naming
kalomaze observed that one engineer's DPO dataset is another's Bradley-Terry reward model dataset, a perspective shift that drew a sarcastic reply from Brown about training on unrelated GPT-4 judgments.
Users praised the analogy for the DPO dataset concept as very apt, even while noting its esoteric quality.
Most Activity
victor wembanyana studying magnus carlsen endgame losses so he can avoid making the same mistakes
the concept of a “DPO dataset” is honestly crazy
@willccbb one mans "dpo dataset" is another mans "bradley terry reward model dataset"
the concept of a “DPO dataset” is honestly crazy
@kalomaze brb training qwen3.6 based on the distribution of good vs bad gpt-4 answers from some 2023 paper
@willccbb one mans "dpo dataset" is another mans "bradley terry reward model dataset"

but if you’re doing that anyway from a base model, why not just do RL or self-distill? bigger batch + fewer steps if you’re worried about hacks?
caveats on editing = v similar to caveats on self-distillation context. be careful + don’t expect magic if you’re pushing it that far
@willccbb dont forget to take the toxic dpo dataset and then invert the labels (to solve alignment)
@kalomaze brb training qwen3.6 based on the distribution of good vs bad gpt-4 answers from some 2023 paper

@secemp9 a row in a GRPO dataset has zero completions instead of two

@willccbb DPO DATASET IS PEOPLE

@willccbb this is a very apt analogy, albeit somewhat esoteric

@willccbb @kalomaze bro thats the qwen3.7 recipe dont give it away

it’s fine + valid off-policy RL if you’re using a labeled / filtered / surgically edited (with caveats) set of completions from the base model, esp for alignment stuff where you’re not trying to explore anyway
but if the source is something else, it’s like what are you doing lol

@willccbb GRPO dataset coming right up

@willccbb it's fine a labelled synthetic dataset consisting of the chosen and rejected answers clearly labelled for the preferences is ready for ORPO ig or what are u on about? self distillation or rl(without pref opt) also works faster

@willccbb delta learning type shit

@willccbb for images tho... I think i just like saving pictures tbf.

if i generate lots of trajectories (offline) using my SFT'd model and then run it through a bigger judge model and ask to catch for mistakes and give me a "correct response" for the mistake. this will result in lots of rejected (original model), accepted (judge model) pairs. in your opinion, is this a valid way to do offline DPO?

@willccbb you can't take KTO out of my cold dead hands

@kalomaze @willccbb They are turning you to a bradley terry reward model dataset tonight.

@willccbb they want SimPO but they don't know it.