1d ago

John Thickstun questions MAUVE reliability for Eso-LM after noting recent evaluations switched to ModernBERT-Large embeddings from the original GPT-2 and RoBERTa versions used in the metric's development

0

Sander Dieleman confirmed default original embeddings were applied in the tests.

Original post

@jwthickstun @dvruette Hey thanks I wasn’t aware of this metric, seems nice and I want to try it out. One thing I’ll flag is that the gradient moment score is the same for different sample sizes and not sensitive to hyperparameters in that sense.

12:41 AM · May 18, 2026 View on X

@emiel_hoogeboom @jwthickstun @dvruette MAUVE was very useful when we did CDCD, to complement AR-NLL and entropy measurements (which we already knew to be fraught with issues at that point). But we did find that it was non-convex in nucleus p for AR, and the fact that we beat AR was nice, but a bit sus in hindsight 😁

Emiel HoogeboomEmiel Hoogeboom@emiel_hoogeboom

@jwthickstun @dvruette This doesn’t have to be a big problem of course, FID has this too and is pretty good and well used, thanks again and sorry for not noticing this sooner

7:43 AM · May 18, 2026 · 91 Views
9:30 AM · May 18, 2026 · 112 Views

@jwthickstun @emiel_hoogeboom @dvruette That's definitely possible. IIRC we used whatever came with the MAUVE code by default.

John ThickstunJohn Thickstun@jwthickstun

@sedielem @emiel_hoogeboom @dvruette That does sound sus. I wonder if the problem is the embedding model? FID mostly gets away with ancient embeddings, but I can imagine that language evals could be more sensitive. E.g., we used ModernBERT-Large for Eso-LM Mauve evals vs. gpt2-large/RoBERTa in original Mauve.

12:49 PM · May 18, 2026 · 54 Views
1:17 PM · May 18, 2026 · 35 Views

@sedielem @emiel_hoogeboom @dvruette That does sound sus. I wonder if the problem is the embedding model? FID mostly gets away with ancient embeddings, but I can imagine that language evals could be more sensitive. E.g., we used ModernBERT-Large for Eso-LM Mauve evals vs. gpt2-large/RoBERTa in original Mauve.

Sander DielemanSander Dieleman@sedielem

@emiel_hoogeboom @jwthickstun @dvruette MAUVE was very useful when we did CDCD, to complement AR-NLL and entropy measurements (which we already knew to be fraught with issues at that point). But we did find that it was non-convex in nucleus p for AR, and the fact that we beat AR was nice, but a bit sus in hindsight 😁

9:30 AM · May 18, 2026 · 112 Views
12:49 PM · May 18, 2026 · 54 Views

@jwthickstun @dvruette Hey thanks I wasn’t aware of this metric, seems nice and I want to try it out. One thing I’ll flag is that the gradient moment score is the same for different sample sizes and not sensitive to hyperparameters in that sense.

7:41 AM · May 18, 2026 · 43 Views

@jwthickstun @dvruette This doesn’t have to be a big problem of course, FID has this too and is pretty good and well used, thanks again and sorry for not noticing this sooner

Emiel HoogeboomEmiel Hoogeboom@emiel_hoogeboom

@jwthickstun @dvruette Hey thanks I wasn’t aware of this metric, seems nice and I want to try it out. One thing I’ll flag is that the gradient moment score is the same for different sample sizes and not sensitive to hyperparameters in that sense.

7:41 AM · May 18, 2026 · 43 Views
7:43 AM · May 18, 2026 · 91 Views