Study of 67 models finds LLM ensemble performance is strictly capped by a high shared co-failure rate

Original post

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex#501inTech

«Gains come from models failing on different questions, not from adding more models.»

Haven't read it yet but sounds right. Mixture of Models, where all models are general-purpose competing LLMs, is a cope. Just train experts, do MOPD, and then do single-model test time scaling.

Xiuyu Li@sheriyuo

The paper argues that for any ensemble whose final output must be one of the member models' answers, including routing, voting, and MoA, the accuracy is fundamentally capped by the co-failure rate β: acc ≤ 1 − β.

It also shows that the commonly reported mean pairwise error correlation (ρ) is insufficient to characterize β, so low correlation alone does not imply large ensemble gains.

When Does Combining Language Models Help? A Co-Failure Ceiling on Routing, Voting, and Mixture-of-Agents https://arxiv.org/abs/2606.27288

10:42 AM · Jun 27, 2026 · 2.4K Views

2606.27288

ARXIV.ORGVia

VIEWS595RETWEETS11

DAIR.AI@dair_ai

When does combining LLMs help?

Great analysis on combining language models, measured across 67 models from 21 providers.

Any policy that routes, votes, cascades, or runs a mixture of agents and then returns one model's answer is bounded above by 1 minus beta, where beta is the fraction of queries every candidate model gets wrong.

The common justification for ensembling is diversity, usually measured as low pairwise error correlation. The paper proves that correlation cannot identify beta, so decorrelation does not establish that headroom exists. And across the 67 models, real co-failures are far more concentrated than independence-style assumptions predict.

Before assuming a router or MoA setup will help, measure beta. Co-failures cluster on the answer format rather than the subject.

Paper: https://arxiv.org/abs/2606.27288

Learn to build effective AI agents in our academy: https://academy.dair.ai/

6h5.7K3337

LIKES2

Kay@kay_myg

@dair_ai This aligns closely with my work, measuring not just model diversity but correlated failure (β), where multiple agents converge on the same blind spots despite appearing decorrelated.

6h242

REPLIES1

keithofaptos@keithofaptos

Sort your models by intelligence first. And have your agent swarm observing for incorrect answers. Use that data to synthetically create the correct answer (s). Retrain your experts via fine tuning. Rinse and repeat. This should improve your low and medium IQ models intelligence. Your higher intelligence models will probably need another process addition though. This paper shows me how to find the problems. And likely fix most of them. Many more papers and agents researching will definitely come out of this.

5h271

The AI Therapist ⚡@TheAIShrink

@dair_ai Combining beats single models when they fail differently. 67 from 21 providers gives error variance. route by domain, don't just vote

5h211

Strata@ChainZenit

@dair_ai that math is honestly pretty wild to think about.

5h201

Prince does AI@princedoesai

@teortaxesTex i keep coming back to the co-failure rate

9h131

V0LYX@0xV0LYX

@dair_ai so the benefit cap is actually a known bound, not a free lunch. makes you wonder when routing actually pays off vs just picking the best single model.

5h81

keithofaptos@keithofaptos

@dair_ai https://grok.com/share/bGVnYWN5_c919fed4-7c9a-4472-84ba-a869db416db2

5h51

مازن وذكاء الآلات@Mazen_AIEx

@dair_ai I think this is the key point: the game is not more models. It is different error surfaces, stronger routing, and measuring co-failure properly. Beta matters much more than the usual rho story.

5h2

Adel Bucetta@adelbucetta

@teortaxesTex models failing on different questions is actually the real reason why ai gets stuck in local optima

5h1