cross entropy reduces NLL error, reducing NLL error =/= reducing sampling error, as soon as you sample from the tail (mass not explicitly punished by improved NLL), you're off-manifold i achieved the latter pic's improvement on a qwen w/o any direct SFT at all via a GAN-like loop
my intuitions around "NTP distillation is fairly weak, GAN-like macro approximation of subsequences via RL / density ratio estimation is strong, nobody has executed the latter well so far at scale" is something that i may have to flesh out into a concrete research program
