When everyone uses the same evals, data, distillation and vendors to train LLMs.
Courtesy of: https://arxiv.org/abs/2512.15567
A new benchmark called SDE evaluates LLMs on real scientific discovery projects in biology, chemistry, materials science, and physics, moving past static knowledge tests. The study finds Claude Sonnet 4.5, DeepSeek R1, Grok 4, and GPT-5 produce nearly identical sequences of correct and incorrect answers on the same questions, with performance gaps that do not close through simple scaling.
When everyone uses the same evals, data, distillation and vendors to train LLMs.
Courtesy of: https://arxiv.org/abs/2512.15567
Side-by-side plots show the models succeeding or stumbling on matching question indices, a pattern the paper links to systematic weaknesses shared across providers. Causes such as common datasets or distillation steps are noted in discussions but not quantified in the reported results.
Unlike recall-focused benchmarks, SDE requires models to generate hypotheses, run simulations, and interpret iterative results, where no single model leads across all domains. This variation suggests current approaches still rely on guided exploration rather than autonomous discovery.
Users are positive about findings on shared evals homogenizing LLM response patterns because the results align with their world model and highlight interesting details hidden in the appendix.
@NandoDF + question complexity
When everyone uses the same evals, data, distillation and vendors to train LLMs.
Courtesy of: https://arxiv.org/abs/2512.15567

@m_wulfmeier @NandoDF presumably heavily interacts with "the same data" point from @NandoDF :)
that said, I do think fig 7 hints at "question complexity" on aggregate across domains (if you take "reasoning effort" as given)

@NandoDF wait so all the red dots are from different models failing the same questions?
almost like the data contamination pool is one body of water

@NandoDF this fits my world model

@NandoDF This is what peak performance looks like.

@m_wulfmeier @NandoDF truly interesting bits are always hidden away in the appendix