More interesting: top-p nucleus sweep on a small AR LM. PPL drops monotonically as you tighten p.
GM is U-shaped, with a minimum near p=0.90. As p drops below that, mode-collapse pushes GM back up.
GM can of course still be gamed, but it catches more.
Sanity check: take real OWT, tile one row across the batch (extreme repetition). PPL is almost the same: 14.0 vs 14.5 for real data. Looks fine! GM jumps from ~0 to +7.0. Collapse caught. This cannot be caught by the typically used token entropy.
Less extreme: top-p nucleus sweep on a small AR LM. PPL drops monotonically as you tighten p
GM is U-shaped, with a minimum near p=0.90. As p drops below that, mode-collapse pushes GM back up.
Note that GM can still be gamed, it's just more difficult.
Sanity check: take real OWT, tile one row across the batch (extreme repetition). PPL barely budges: 14.0 vs 14.5 for real data. Looks fine! GM jumps from ~0 to +7.0. Collapse caught.