I ran Opus 4.7 and gpt-5.5 on an agentic version of WeirdML. The models improved significantly (both scored almost 90%), especially Opus (which started from a lower base).
They had full access to the training data in a sandbox, but still had to submit code 5 times to be scored like regular WeirdML.
They achieved the higher score mostly by more consistently scoring really well on each task, not (mostly) by improving the SOTA on each task. For more details, see the Agentic WeirdML page on the website (link in thread).
WeirdML v2 is now out! The update includes a bunch of new tasks (now 19 tasks total, up from 6), and results from all the latest models. We now also track api costs and other metadata which give more insight into the different models. The new results are shown in these two figures. The first one shows an overview of the overall results as well as the results on individual tasks, in addition to various metadata.
The second figure shows cost vs performance and shows a clear scaling with better results for higher costs. We also have a very varied pareto frontier with 11 models from 6 different companies having the best accuracy for a given cost for at least some of the cost range. Grok 3, Claude Opus 4 and GPT 4.5 are the ones that underperform for their costs, while Gemini pro and o3 pro have the best results at the highest costs. Qwen3 30B3A, grok 3 mini and deepseek R1 also each represent a good chunk of the pareto frontier.


