I ran Opus 4.7 and gpt-5.5 on an agentic version of WeirdML. The models improved significantly (both scored almost 90%), especially Opus (which started from a lower base).
They had full access to the training data in a sandbox, but still had to submit code 5 times to be scored like regular WeirdML.
They achieved the higher score mostly by more consistently scoring really well on each task, not (mostly) by improving the SOTA on each task. For more details, see the Agentic WeirdML page on the website (link in thread).