Sanity check: take real OWT, tile one row across the batch (extreme repetition). PPL is almost the same: 14.0 vs 14.5 for real data. Looks fine!
GM jumps from ~0 to +7.0. Collapse caught.
This cannot be caught by the typically used token entropy.
The motivation: for models without a tractable likelihood (distilled discrete diffusion, in our case), generative PPL is easy to game by sampling at low entropy. You get "better" PPL by being more repetitive. GM uses the gradient of a reference LM's NLL instead.
Sanity check: take real OWT, tile one row across the batch (extreme repetition). PPL barely budges: 14.0 vs 14.5 for real data. Looks fine!
GM jumps from ~0 to +7.0. Collapse caught.
The motivation: for models without a tractable likelihood (distilled discrete diffusion, in our case), generative PPL is easy to game by sampling at low entropy. You get "better" PPL by being repetitive. GM uses the gradient of a reference LM's NLL instead.