1d ago

Gradient Matching Metric Detects Repetition Collapse Missed by Perplexity

1610550

——0——

Original post

Emiel Hoogeboom#1920@EMIEL_HOOGEBOOM

Sanity check: take real OWT, tile one row across the batch (extreme repetition). PPL is almost the same: 14.0 vs 14.5 for real data. Looks fine! GM jumps from ~0 to +7.0. Collapse caught. This cannot be caught by the typically used token entropy.

6:49 AM · May 16, 2026

Cluster engagement

123 snapshots

#1920Emiel Hoogeboom@EMIEL_HOOGEBOOM

Sanity check: take real OWT, tile one row across the batch (extreme repetition). PPL is almost the same: 14.0 vs 14.5 for real data. Looks fine!

GM jumps from ~0 to +7.0. Collapse caught.

This cannot be caught by the typically used token entropy.

Emiel Hoogeboom@emiel_hoogeboom

The motivation: for models without a tractable likelihood (distilled discrete diffusion, in our case), generative PPL is easy to game by sampling at low entropy. You get "better" PPL by being more repetitive. GM uses the gradient of a reference LM's NLL instead.

1:49 PM · May 16, 2026 · 1 Views

1:49 PM · May 16, 2026 · 3 Views

#1920Emiel Hoogeboom@EMIEL_HOOGEBOOM

Sanity check: take real OWT, tile one row across the batch (extreme repetition). PPL barely budges: 14.0 vs 14.5 for real data. Looks fine!

GM jumps from ~0 to +7.0. Collapse caught.

Emiel Hoogeboom@emiel_hoogeboom

The motivation: for models without a tractable likelihood (distilled discrete diffusion, in our case), generative PPL is easy to game by sampling at low entropy. You get "better" PPL by being repetitive. GM uses the gradient of a reference LM's NLL instead.

1:57 PM · May 16, 2026 · 485 Views

1:57 PM · May 16, 2026 · 550 Views