«Gains come from models failing on different questions, not from adding more models.»
Haven't read it yet but sounds right. Mixture of Models, where all models are general-purpose competing LLMs, is a cope. Just train experts, do MOPD, and then do single-model test time scaling.
The paper argues that for any ensemble whose final output must be one of the member models' answers, including routing, voting, and MoA, the accuracy is fundamentally capped by the co-failure rate β: acc ≤ 1 − β.
It also shows that the commonly reported mean pairwise error correlation (ρ) is insufficient to characterize β, so low correlation alone does not imply large ensemble gains.
When Does Combining Language Models Help? A Co-Failure Ceiling on Routing, Voting, and Mixture-of-Agents https://arxiv.org/abs/2606.27288







