Researchers on X discuss gradient moment score and MAUVE as alternatives to FID for generative models
Researchers on X examined evaluation metrics for generative models as alternatives or complements to FID. Key points included the gradient moment score remaining stable across different sample sizes with low sensitivity to hyperparameter choices. Participants also described MAUVE's application in developing the CDCD model, where it supplemented AR-NLL and entropy measurements, and noted its non-convexity with respect to the nucleus sampling parameter p in autoregressive settings.
@emiel_hoogeboom @jwthickstun @dvruette MAUVE was very useful when we did CDCD, to complement AR-NLL and entropy measurements (which we already knew to be fraught with issues at that point). But we did find that it was non-convex in nucleus p for AR, and the fact that we beat AR was nice, but a bit sus in hindsight 😁
@jwthickstun @dvruette This doesn’t have to be a big problem of course, FID has this too and is pretty good and well used, thanks again and sorry for not noticing this sooner
@jwthickstun @emiel_hoogeboom @dvruette That's definitely possible. IIRC we used whatever came with the MAUVE code by default.
@sedielem @emiel_hoogeboom @dvruette That does sound sus. I wonder if the problem is the embedding model? FID mostly gets away with ancient embeddings, but I can imagine that language evals could be more sensitive. E.g., we used ModernBERT-Large for Eso-LM Mauve evals vs. gpt2-large/RoBERTa in original Mauve.
@sedielem @emiel_hoogeboom @dvruette That does sound sus. I wonder if the problem is the embedding model? FID mostly gets away with ancient embeddings, but I can imagine that language evals could be more sensitive. E.g., we used ModernBERT-Large for Eso-LM Mauve evals vs. gpt2-large/RoBERTa in original Mauve.
@emiel_hoogeboom @jwthickstun @dvruette MAUVE was very useful when we did CDCD, to complement AR-NLL and entropy measurements (which we already knew to be fraught with issues at that point). But we did find that it was non-convex in nucleus p for AR, and the fact that we beat AR was nice, but a bit sus in hindsight 😁
@jwthickstun @dvruette Hey thanks I wasn’t aware of this metric, seems nice and I want to try it out. One thing I’ll flag is that the gradient moment score is the same for different sample sizes and not sensitive to hyperparameters in that sense.
@jwthickstun @dvruette This doesn’t have to be a big problem of course, FID has this too and is pretty good and well used, thanks again and sorry for not noticing this sooner
@jwthickstun @dvruette Hey thanks I wasn’t aware of this metric, seems nice and I want to try it out. One thing I’ll flag is that the gradient moment score is the same for different sample sizes and not sensitive to hyperparameters in that sense.