13h ago

Researchers on X discuss gradient moment score and MAUVE as alternatives to FID for generative models

0

Researchers on X examined evaluation metrics for generative models as alternatives or complements to FID. Key points included the gradient moment score remaining stable across different sample sizes with low sensitivity to hyperparameter choices. Participants also described MAUVE's application in developing the CDCD model, where it supplemented AR-NLL and entropy measurements, and noted its non-convexity with respect to the nucleus sampling parameter p in autoregressive settings.

Original post

@jwthickstun @dvruette Hey thanks I wasn’t aware of this metric, seems nice and I want to try it out. One thing I’ll flag is that the gradient moment score is the same for different sample sizes and not sensitive to hyperparameters in that sense.

12:41 AM · May 18, 2026 View on X

@emiel_hoogeboom @jwthickstun @dvruette MAUVE was very useful when we did CDCD, to complement AR-NLL and entropy measurements (which we already knew to be fraught with issues at that point). But we did find that it was non-convex in nucleus p for AR, and the fact that we beat AR was nice, but a bit sus in hindsight 😁

Emiel HoogeboomEmiel Hoogeboom@emiel_hoogeboom

@jwthickstun @dvruette This doesn’t have to be a big problem of course, FID has this too and is pretty good and well used, thanks again and sorry for not noticing this sooner

7:43 AM · May 18, 2026 · 85 Views
9:30 AM · May 18, 2026 · 102 Views

@jwthickstun @emiel_hoogeboom @dvruette That's definitely possible. IIRC we used whatever came with the MAUVE code by default.

John ThickstunJohn Thickstun@jwthickstun

@sedielem @emiel_hoogeboom @dvruette That does sound sus. I wonder if the problem is the embedding model? FID mostly gets away with ancient embeddings, but I can imagine that language evals could be more sensitive. E.g., we used ModernBERT-Large for Eso-LM Mauve evals vs. gpt2-large/RoBERTa in original Mauve.

12:49 PM · May 18, 2026 · 41 Views
1:17 PM · May 18, 2026 · 28 Views

@sedielem @emiel_hoogeboom @dvruette That does sound sus. I wonder if the problem is the embedding model? FID mostly gets away with ancient embeddings, but I can imagine that language evals could be more sensitive. E.g., we used ModernBERT-Large for Eso-LM Mauve evals vs. gpt2-large/RoBERTa in original Mauve.

Sander DielemanSander Dieleman@sedielem

@emiel_hoogeboom @jwthickstun @dvruette MAUVE was very useful when we did CDCD, to complement AR-NLL and entropy measurements (which we already knew to be fraught with issues at that point). But we did find that it was non-convex in nucleus p for AR, and the fact that we beat AR was nice, but a bit sus in hindsight 😁

9:30 AM · May 18, 2026 · 102 Views
12:49 PM · May 18, 2026 · 41 Views

@jwthickstun @dvruette Hey thanks I wasn’t aware of this metric, seems nice and I want to try it out. One thing I’ll flag is that the gradient moment score is the same for different sample sizes and not sensitive to hyperparameters in that sense.

7:41 AM · May 18, 2026 · 37 Views

@jwthickstun @dvruette This doesn’t have to be a big problem of course, FID has this too and is pretty good and well used, thanks again and sorry for not noticing this sooner

Emiel HoogeboomEmiel Hoogeboom@emiel_hoogeboom

@jwthickstun @dvruette Hey thanks I wasn’t aware of this metric, seems nice and I want to try it out. One thing I’ll flag is that the gradient moment score is the same for different sample sizes and not sensitive to hyperparameters in that sense.

7:41 AM · May 18, 2026 · 37 Views
7:43 AM · May 18, 2026 · 85 Views
Researchers on X discuss gradient moment score and MAUVE as alternatives to FID for generative models · Digg