/Tech43d ago

John Thickstun questions MAUVE reliability for Eso-LM after noting recent evaluations switched to ModernBERT-Large embeddings from the original GPT-2 and RoBERTa versions used in the metric's development

AI Judge changed title after evaluation, original title: "John Thickstun notes MAUVE evaluations for Eso-LM used ModernBERT-Large embeddings unlike the original GPT-2 or RoBERTa versions, raising questions about embedding model effects on metric sensitivity"

Sander Dieleman confirmed default original embeddings were applied in the tests.

4611335

#91

Original post

Sander Dieleman@sedielem#91inTech

@emiel_hoogeboom @jwthickstun @dvruette MAUVE was very useful when we did CDCD, to complement AR-NLL and entropy measurements (which we already knew to be fraught with issues at that point). But we did find that it was non-convex in nucleus p for AR, and the fact that we beat AR was nice, but a bit sus in hindsight 😁

Emiel Hoogeboom@emiel_hoogeboom

@jwthickstun @dvruette This doesn’t have to be a big problem of course, FID has this too and is pretty good and well used, thanks again and sorry for not noticing this sooner

2:30 AM · May 18, 2026 · 112 Views

Sentiment

Users note that Gradient Moment Score stability across sample sizes need not be a major issue, since the widely used FID metric shares this trait and remains effective.

Pos

100.0%

Neg

0.0%

1 comments with sentiment.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Posts from X

Most Activity

VIEWS54LIKES1REPLIES1

John Thickstun@jwthickstun

@sedielem @emiel_hoogeboom @dvruette That does sound sus. I wonder if the problem is the embedding model? FID mostly gets away with ancient embeddings, but I can imagine that language evals could be more sensitive. E.g., we used ModernBERT-Large for Eso-LM Mauve evals vs. gpt2-large/RoBERTa in original Mauve.

Sander Dieleman@sedielem

42d5410

Sander Dieleman@sedielem

@jwthickstun @emiel_hoogeboom @dvruette That's definitely possible. IIRC we used whatever came with the MAUVE code by default.

John Thickstun@jwthickstun

42d3500

Emiel Hoogeboom@emiel_hoogeboom

@jwthickstun @dvruette This doesn’t have to be a big problem of course, FID has this too and is pretty good and well used, thanks again and sorry for not noticing this sooner

43d3