13h ago

Researchers on X discuss gradient moment score and MAUVE as alternatives to FID for generative models

4611293

——0——

Researchers on X examined evaluation metrics for generative models as alternatives or complements to FID. Key points included the gradient moment score remaining stable across different sample sizes with low sensitivity to hyperparameter choices. Participants also described MAUVE's application in developing the CDCD model, where it supplemented AR-NLL and entropy measurements, and noted its non-convexity with respect to the nucleus sampling parameter p in autoregressive settings.

Original post

#1920Emiel Hoogeboom@EMIEL_HOOGEBOOM

@jwthickstun @dvruette Hey thanks I wasn’t aware of this metric, seems nice and I want to try it out. One thing I’ll flag is that the gradient moment score is the same for different sample sizes and not sensitive to hyperparameters in that sense.

12:41 AM · May 18, 2026

#80Sander Dieleman@SEDIELEM

@emiel_hoogeboom @jwthickstun @dvruette MAUVE was very useful when we did CDCD, to complement AR-NLL and entropy measurements (which we already knew to be fraught with issues at that point). But we did find that it was non-convex in nucleus p for AR, and the fact that we beat AR was nice, but a bit sus in hindsight 😁

Emiel Hoogeboom@emiel_hoogeboom

@jwthickstun @dvruette This doesn’t have to be a big problem of course, FID has this too and is pretty good and well used, thanks again and sorry for not noticing this sooner

7:43 AM · May 18, 2026 · 85 Views

9:30 AM · May 18, 2026 · 102 Views

#80Sander Dieleman@SEDIELEM

@jwthickstun @emiel_hoogeboom @dvruette That's definitely possible. IIRC we used whatever came with the MAUVE code by default.

John Thickstun@jwthickstun

@sedielem @emiel_hoogeboom @dvruette That does sound sus. I wonder if the problem is the embedding model? FID mostly gets away with ancient embeddings, but I can imagine that language evals could be more sensitive. E.g., we used ModernBERT-Large for Eso-LM Mauve evals vs. gpt2-large/RoBERTa in original Mauve.

12:49 PM · May 18, 2026 · 41 Views

1:17 PM · May 18, 2026 · 28 Views

#1366John Thickstun@JWTHICKSTUN

Sander Dieleman@sedielem

9:30 AM · May 18, 2026 · 102 Views

12:49 PM · May 18, 2026 · 41 Views

#1920Emiel Hoogeboom@EMIEL_HOOGEBOOM

7:41 AM · May 18, 2026 · 37 Views

#1920Emiel Hoogeboom@EMIEL_HOOGEBOOM

@jwthickstun @dvruette This doesn’t have to be a big problem of course, FID has this too and is pretty good and well used, thanks again and sorry for not noticing this sooner

Emiel Hoogeboom@emiel_hoogeboom

7:41 AM · May 18, 2026 · 37 Views

7:43 AM · May 18, 2026 · 85 Views

Researchers on X discuss gradient moment score and MAUVE as alternatives to FID for generative models

Cluster engagement

Sentiment