/Tech4h ago

Will Brown of Prime Intellect argues DPO dataset categorization is counterintuitive, sparking debate on Bradley-Terry equivalence

Story Overview

Will Brown flagged the 'DPO dataset' label as odd when the preference pairs come from outside the base model, likening it to training on unrelated expert losses instead of the model's own outputs, and floated RL or self-distillation as possible swaps when generation is already happening.

171170179.9K

#573

Original post

will brown@willccbb#573inTech

the concept of a “DPO dataset” is honestly crazy

12:33 AM · Jun 13, 2026 · 5.9K Views

Open Question

Dataset Origins Shape The Workflow

Brown clarified that labeled completions from the model itself can still count as off-policy RL for alignment tasks, yet he questioned the framing for fully external sources and noted editing caveats mirror those in self-distillation.

FYI

Same Data, Different Naming

kalomaze observed that one engineer's DPO dataset is another's Bradley-Terry reward model dataset, a perspective shift that drew a sarcastic reply from Brown about training on unrelated GPT-4 judgments.

Sentiment

Users praised the analogy for the DPO dataset concept as very apt, even while noting its esoteric quality.

Pos

100.0%

Neg

0.0%

1 comments with sentiment.

Cluster Engagement

Posts from X

Most Activity

VIEWS1.6KBOOKMARKS2LIKES18REPLIES2

will brown@willccbb

victor wembanyana studying magnus carlsen endgame losses so he can avoid making the same mistakes

will brown@willccbb

the concept of a “DPO dataset” is honestly crazy

4h1.6K182

kalomaze@kalomaze

@willccbb one mans "dpo dataset" is another mans "bradley terry reward model dataset"

will brown@willccbb

the concept of a “DPO dataset” is honestly crazy

3h1K182

will brown@willccbb

@kalomaze brb training qwen3.6 based on the distribution of good vs bad gpt-4 answers from some 2023 paper

kalomaze@kalomaze

@willccbb one mans "dpo dataset" is another mans "bradley terry reward model dataset"

3h996181

will brown@willccbb

but if you’re doing that anyway from a base model, why not just do RL or self-distill? bigger batch + fewer steps if you’re worried about hacks?

caveats on editing = v similar to caveats on self-distillation context. be careful + don’t expect magic if you’re pushing it that far

3h11631

kalomaze@kalomaze

@willccbb dont forget to take the toxic dpo dataset and then invert the labels (to solve alignment)

will brown@willccbb

@kalomaze brb training qwen3.6 based on the distribution of good vs bad gpt-4 answers from some 2023 paper

3h17160

will brown@willccbb

@secemp9 a row in a GRPO dataset has zero completions instead of two

4h633

meowbooks@meowbooksj

@willccbb DPO DATASET IS PEOPLE

3h413

PureTensor@puretensorai

@willccbb this is a very apt analogy, albeit somewhat esoteric

3h131

N8 Programs@N8Programs

@willccbb @kalomaze bro thats the qwen3.7 recipe dont give it away

47m34

will brown@willccbb

it’s fine + valid off-policy RL if you’re using a labeled / filtered / surgically edited (with caveats) set of completions from the base model, esp for alignment stuff where you’re not trying to explore anyway

but if the source is something else, it’s like what are you doing lol

4h30

secemp@secemp9

@willccbb GRPO dataset coming right up

4h23

josepha_mayo@josepha_mayo

@willccbb it's fine a labelled synthetic dataset consisting of the chosen and rejected answers clearly labelled for the preferences is ready for ORPO ig or what are u on about? self distillation or rl(without pref opt) also works faster

37m51

Bojan Jakimovski@Shekswess

@willccbb delta learning type shit

3h10

azul@cathode_dreams

@willccbb for images tho... I think i just like saving pictures tbf.

18m9

Pritish Mishra@pritmish

if i generate lots of trajectories (offline) using my SFT'd model and then run it through a bigger judge model and ask to catch for mistakes and give me a "correct response" for the mistake. this will result in lots of rejected (original model), accepted (judge model) pairs. in your opinion, is this a valid way to do offline DPO?

1h8

Cossale — oss/acc@XCossale

@willccbb you can't take KTO out of my cold dead hands

2h6

xeon@saymycodename

@kalomaze @willccbb They are turning you to a bradley terry reward model dataset tonight.

3h6

ShitCockaSays@batcz

@willccbb they want SimPO but they don't know it.

3h2