In WeirdML we see opus models increasingly use submissions to just explore the data, without actually trying to solve the problem (no predictions for the test set).
It seems like, with no or low thinking, at least for some tasks, the prior for Opus to just explore the data overrules the system instructions saying that the score on every submission counts and will be compared to other models.
Opus does not feel the rush to maximize the score fast, it simply tries to understand the data. While this kind of attitude is great in Claude-Code, it costs Opus a lot on an eval like WeirdML, where getting the most out of every submission is important.
Exploring the data is definitely crucial in WeirdML, and GPT and Gemini also do this (prints out a bunch of info, tests hypotheses about the data etc). They just also make their best effort to solve the task at every iteration (which means that they both get more info from each submission and also get valid scores etc).
Opus, somehow, does not feel the same urgency when in this eval. Not exactly sure what this means, and more thinking does make Opus use the submissions more effectively, but it's at least an interesting finding.