John Thickstun questions MAUVE reliability for Eso-LM after noting recent evaluations switched to ModernBERT-Large embeddings from the original GPT-2 and RoBERTa versions used in the metric's development

REPLY

@emiel_hoogeboom @jwthickstun @dvruette MAUVE was very useful when we did CDCD, to complement AR-NLL and entropy measurements (which we already knew to be fraught with issues at that point). But we did find that it was non-convex in nucleus p for AR, and the fact that we beat AR was nice, but a bit sus in hindsight 😁

Emiel Hoogeboom@emiel_hoogeboom

@jwthickstun @dvruette This doesn’t have to be a big problem of course, FID has this too and is pretty good and well used, thanks again and sorry for not noticing this sooner

7:43 AM · May 18, 2026 · 91 Views

9:30 AM · May 18, 2026 · 112 Views

REPLY

#80Sander Dieleman@SEDIELEM

@jwthickstun @emiel_hoogeboom @dvruette That's definitely possible. IIRC we used whatever came with the MAUVE code by default.

John Thickstun@jwthickstun

@sedielem @emiel_hoogeboom @dvruette That does sound sus. I wonder if the problem is the embedding model? FID mostly gets away with ancient embeddings, but I can imagine that language evals could be more sensitive. E.g., we used ModernBERT-Large for Eso-LM Mauve evals vs. gpt2-large/RoBERTa in original Mauve.

12:49 PM · May 18, 2026 · 54 Views

1:17 PM · May 18, 2026 · 35 Views

REPLY

#1366John Thickstun@JWTHICKSTUN

@sedielem @emiel_hoogeboom @dvruette That does sound sus. I wonder if the problem is the embedding model? FID mostly gets away with ancient embeddings, but I can imagine that language evals could be more sensitive. E.g., we used ModernBERT-Large for Eso-LM Mauve evals vs. gpt2-large/RoBERTa in original Mauve.

Sander Dieleman@sedielem

@emiel_hoogeboom @jwthickstun @dvruette MAUVE was very useful when we did CDCD, to complement AR-NLL and entropy measurements (which we already knew to be fraught with issues at that point). But we did find that it was non-convex in nucleus p for AR, and the fact that we beat AR was nice, but a bit sus in hindsight 😁

9:30 AM · May 18, 2026 · 112 Views

12:49 PM · May 18, 2026 · 54 Views

REPLY

#1920Emiel Hoogeboom@EMIEL_HOOGEBOOM

@jwthickstun @dvruette Hey thanks I wasn’t aware of this metric, seems nice and I want to try it out. One thing I’ll flag is that the gradient moment score is the same for different sample sizes and not sensitive to hyperparameters in that sense.

7:41 AM · May 18, 2026 · 43 Views

REPLY

#1920Emiel Hoogeboom@EMIEL_HOOGEBOOM

@jwthickstun @dvruette This doesn’t have to be a big problem of course, FID has this too and is pretty good and well used, thanks again and sorry for not noticing this sooner

Emiel Hoogeboom@emiel_hoogeboom

@jwthickstun @dvruette Hey thanks I wasn’t aware of this metric, seems nice and I want to try it out. One thing I’ll flag is that the gradient moment score is the same for different sample sizes and not sensitive to hyperparameters in that sense.

7:41 AM · May 18, 2026 · 43 Views

7:43 AM · May 18, 2026 · 91 Views

John Thickstun questions MAUVE reliability for Eso-LM after noting recent evaluations switched to ModernBERT-Large embeddings from the original GPT-2 and RoBERTa versions used in the metric's development

Cluster engagement

Sentiment