/Tech2h ago

Hugging Face CEO Clément Delangue argues AI benchmarks favor closed-source APIs that use hidden routing and fallback techniques

Story Overview

Hugging Face CEO Clément Delangue highlighted an Artificial Analysis chart where closed APIs topped the Intelligence Index by combining multiple models through undisclosed routing and fallbacks, framing the setup as an uneven comparison against single open models.

38193212010.2K

#109

Original post

clem 🤗@ClementDelangue#109inTech

This graph captures what’s broken about AI evals: they structurally favor closed-source APIs that can route, fallback, ensemble, and optimize behind the scenes with no transparency.

No offense, @ArtificialAnlys, but how is comparing one model to two models fair?

7:06 AM · Jun 12, 2026 · 10.9K Views

Open Question

Benchmarks target full services

Replies in the thread stressed that the evaluation measures end-to-end API performance rather than isolated models, which explains why entries like Claude 3.5 with fallback appear labeled yet still draw scrutiny over hidden optimizations.

Evaluation Fairness

Unclear effects beyond labels

No data in the discussion quantifies how much undisclosed techniques lift scores outside the explicitly marked fallback cases, leaving the practical fairness of these rankings unresolved.

Sentiment

Many users dismissed AI benchmarks as rigged and compromised because they pit single models against multi-system routing setups rather than evaluating standalone performance.

Pos

0.0%

Neg

100.0%

12 comments with sentiment.

Cluster Engagement

Posts from X

Most Activity

Andreas Kirsch 🇺🇦@BlackHC

@ClementDelangue @ArtificialAnlys It doesn't compare models but APIs/services

Also in this specific instance two models is obviously weaker than the single Mythos model? 🌝

clem 🤗@ClementDelangue

This graph captures what’s broken about AI evals: they structurally favor closed-source APIs that can route, fallback, ensemble, and optimize behind the scenes with no transparency.

No offense, @ArtificialAnlys, but how is comparing one model to two models fair?

1h24550

BOOKMARKS2RETWEETS2

Jake@JakeKAllDay

The fallback to Opus 4.8 represents a nerf on fable. But their notes show it’s <10% of the time and the score delta between the two models means the impact has to be <1 point bearish given the weight averages.

It’s the real world routing decisions from Ant, so they earned the lower score.

1h14382

LIKES8

clem 🤗@ClementDelangue

@JakeKAllDay @ArtificialAnlys I'm going to route my failing requests to fable and take the lead of the leaderboard...

1h13181

REPLIES2

Moonlit Monkey@MoonlitMonkey69

@JakeKAllDay @ClementDelangue @ArtificialAnlys If it's nerfed, which documents I believe suggested it was? then a fallback may advantage it over a native answer.

41m22

Harrison Kinsley@Sentdex

@ClementDelangue @ArtificialAnlys This was a very frustrating concept to convey back when I was working on LLM training. Everyone wants to compare open models to OAI/Anthropic on benchmarks, but that's comparing a raw model to a model+harness. That's why I pay more attn to benches w/ harnesses.

48m13171

Micah Hill-Smith@_micah_h

@ClementDelangue @ArtificialAnlys Hey @ClementDelangue - we're showing "Fable with fallback" on the leaderboard, with clear labelling, on the basis that it's the product that Anthropic is offering.

We're working on further analysis of Claude Fable and will have more breakdowns to show soon.

1h824

clem 🤗@ClementDelangue

@MeryemArik9 @ArtificialAnlys yes maybe we should evaluate omni @victormustar or https://openrouter.ai/docs/guides/routing/provider-selection @alexatallah? Or anyone want to built a routing system and evaluate on AA?

1h1282

Micah Hill-Smith@_micah_h

@ClementDelangue @ArtificialAnlys We disclosed in our launch coverage that we saw fallback routing in ~8% of queries:

We very much agree that the results with and without the fallback routing are interesting!

1h282

Meryem Arik@MeryemArik9

@ClementDelangue @ArtificialAnlys Systems become more important than models? Would be worth AA benchmarking teams that are building ensemble systems of OS models

1h147

David Hariri@davidhariri

@ClementDelangue @ArtificialAnlys Hm. Spiritually I want to agree, but if you can measure both by tok/s, $/1M input/output and the input and output is functionally equivalent, isn't it just up to the reader to make a latency:cost:intelligence decision?

1h113

Jake@JakeKAllDay

@ClementDelangue @ArtificialAnlys I'm genuinely trying to understand the objection, do you think the routing to opus gives some sort of advantage?

Anthropic has a *ton* of fuckery in their 1p marketing benches (picking the better score b/w fable + mythos in one column 🙄) but this just seems like estimate noise?

1h251

clem 🤗@ClementDelangue

@_micah_h @ArtificialAnlys also would be interested what would be the results with 0 for the answers fable refuses to answer

1h55

clem 🤗@ClementDelangue

@_micah_h @ArtificialAnlys wow 8% is a lot! is there a leaderboard with the % of refusal, that'd be cool!

1h55

Jake@JakeKAllDay

there are ways for companies to game benchmarks through variable serving, but the question is if that variable serving is reflect of the real world service an end user gets.

If oAI comes up with 50 different instances of 5.6 and routes to different versions of the models based on topic triggers to maximize performance for the user, as long as that is the experience I get on average, that is the experience that should be benchmarked, even if its 50 models in a trench coat. If MoE norms move toward hierarchical MoEs vs the current low level paradigm, its more or less the same idea.

Whatever noise is injected into benchmarks by fallbacks, I've seen no evidence that it is remotely as meaningful as noise from benchmaxxing + contamination

28m19

Oleksii Kuchaiev@kuchaev

@ClementDelangue @ArtificialAnlys Agree, if the model failed to answer a question via refusal it should count as fail @ArtificialAnlys

1h1452

Jake@JakeKAllDay

Opus is a smaller, less capable model. The score you are seeing is a ~5-10% weighted average effectively of Opus performance. But basically across the board opus performs within 10% of the scores Fable on all but a few areas (eg cyberSec), so the score difference we're talking about here is ~0.6 points. Within any reasonable margin of error.

If the premise is that they're cherry picking and Opus is better at those areas, no, there's absolutely no evidence to support that.

Anthropic still has access to those topics/areas where Fable outperforms Opus, they just don't share it externally/outside select partners. There's already jailbreak examples of mythos operating in its 'prohibited' areas where its highly performant (relative to opus), but of course those dont belong in a bench since they're not the actually served product.

36m10

Micah Hill-Smith@_micah_h

@ClementDelangue @ArtificialAnlys Agree on both - should have charts of these next week!

1h112

Matt Gibson@MattGibsonMusic

@ClementDelangue @ArtificialAnlys One name on the chart. Three systems in the basement. That's not a leaderboard — that's the definition of 'rigged'...

1h88

Strata@ChainZenit

@ClementDelangue @ArtificialAnlys that gap in the evaluation logic is actually wild.

2h86

Prashanth (Manohar) Velidandi@PMV_InferX

@ClementDelangue @ArtificialAnlys This whole evals by these companies feels compromised.

2h79