I'm going to re-retweet this, because it's important! There are a few pitfalls when evaluating diffusion language models, highlighted in these two recent blog posts: - https://patrickpynadath1.github.io/blog/eval_methodology/ - https://samacquaviva.com/projects/flow-evals/
Both are worth a read if you have an interest in this space!
Flow models are a promising alternative to autoregression. But the current standard of evaluating flow models is broken. The reported 3x improvement in 1024-step PPL since 2023 is closer to 1.1x if you control for sample entropy. (1/12)