/Tech1h ago

Analysis finds Claude Sonnet 4.5, DeepSeek R1, Grok 4, and GPT-5 exhibit highly correlated error patterns on benchmarks

Story Overview

A new benchmark called SDE evaluates LLMs on real scientific discovery projects in biology, chemistry, materials science, and physics, moving past static knowledge tests. The study finds Claude Sonnet 4.5, DeepSeek R1, Grok 4, and GPT-5 produce nearly identical sequences of correct and incorrect answers on the same questions, with performance gaps that do not close through simple scaling.

6324202.5K

#22

Original post

Nando de Freitas@NandoDF#22inTech

When everyone uses the same evals, data, distillation and vendors to train LLMs.

Courtesy of: https://arxiv.org/abs/2512.15567

1:25 AM · Jun 15, 2026 · 2.4K Views

Open Question

Correlated failures point to training overlap

Side-by-side plots show the models succeeding or stumbling on matching question indices, a pattern the paper links to systematic weaknesses shared across providers. Causes such as common datasets or distillation steps are noted in discussions but not quantified in the reported results.

Benchmark Shift

Project-level tests expose scenario-specific gaps

Unlike recall-focused benchmarks, SDE requires models to generate hypotheses, run simulations, and interpret iterative results, where no single model leads across all domains. This variation suggests current approaches still rely on guided exploration rather than autonomous discovery.

Sentiment

Users are positive about findings on shared evals homogenizing LLM response patterns because the results align with their world model and highlight interesting details hidden in the appendix.

Pos

100.0%

Neg

0.0%

2 comments with sentiment.

Cluster Engagement

Posts from X

Most Activity

VIEWS170LIKES1REPLIES1

Markus Wulfmeier@m_wulfmeier

@NandoDF + question complexity

Nando de Freitas@NandoDF

When everyone uses the same evals, data, distillation and vendors to train LLMs.

Courtesy of: https://arxiv.org/abs/2512.15567

1h17010

Toni Kukurin@tkukurin

@m_wulfmeier @NandoDF presumably heavily interacts with "the same data" point from @NandoDF :)

that said, I do think fig 7 hints at "question complexity" on aggregate across domains (if you take "reasoning effort" as given)

1h601

tsunami_crypto@ls_brd

@NandoDF wait so all the red dots are from different models failing the same questions?

almost like the data contamination pool is one body of water

1h21

Martian@space_colonist

@NandoDF this fits my world model

46m11

GEE!@KhieEmm

@NandoDF This is what peak performance looks like.

1h5

Toni Kukurin@tkukurin

@m_wulfmeier @NandoDF truly interesting bits are always hidden away in the appendix

1h4