/Tech30d ago

Håvard Ihle's agentic WeirdML benchmark tests push Claude Opus 4.7 and GPT-5.5 to 90% accuracy

Inference costs rose tenfold to $10 per run

10143102925.1K

#501

Original post

Florian Brand#1778

Håvard Ihle@htihle

I ran Opus 4.7 and gpt-5.5 on an agentic version of WeirdML. The models improved significantly (both scored almost 90%), especially Opus (which started from a lower base).

They had full access to the training data in a sandbox, but still had to submit code 5 times to be scored like regular WeirdML.

They achieved the higher score mostly by more consistently scoring really well on each task, not (mostly) by improving the SOTA on each task. For more details, see the Agentic WeirdML page on the website (link in thread).

Håvard Ihle@htihle

WeirdML v2 is now out! The update includes a bunch of new tasks (now 19 tasks total, up from 6), and results from all the latest models. We now also track api costs and other metadata which give more insight into the different models. The new results are shown in these two figures. The first one shows an overview of the overall results as well as the results on individual tasks, in addition to various metadata.

The second figure shows cost vs performance and shows a clear scaling with better results for higher costs. We also have a very varied pareto frontier with 11 models from 6 different companies having the best accuracy for a given cost for at least some of the cost range. Grok 3, Claude Opus 4 and GPT 4.5 are the ones that underperform for their costs, while Gemini pro and o3 pro have the best results at the highest costs. Qwen3 30B3A, grok 3 mini and deepseek R1 also each represent a good chunk of the pareto frontier.

1:48 AM · May 31, 2026 · 15.9K Views

Sentiment

Users are excited about agentic Claude Opus 4.7 and GPT-5.5 reaching 90% accuracy on the WeirdML benchmark because the results demonstrate strong performance and spark discussion on evals and token limits.

Pos

100.0%

Neg

0.0%

2 comments with sentiment.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Posts from X

Most Activity

VIEWS6.3KBOOKMARKS5LIKES49RETWEETS1

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

OK, this somewhat rescues Opus. I obviously want to see the trend for open models

Håvard Ihle@htihle

I ran Opus 4.7 and gpt-5.5 on an agentic version of WeirdML. The models improved significantly (both scored almost 90%), especially Opus (which started from a lower base).

They had full access to the training data in a sandbox, but still had to submit code 5 times to be scored like regular WeirdML.

30d6.3K495

REPLIES1

Ben (no treats)@andersonbcdefg

babe wake up they saturated weirdML

Håvard Ihle@htihle

I ran Opus 4.7 and gpt-5.5 on an agentic version of WeirdML. The models improved significantly (both scored almost 90%), especially Opus (which started from a lower base).

They had full access to the training data in a sandbox, but still had to submit code 5 times to be scored like regular WeirdML.

29d2.9K144

Florian Brand@xeophon

@htihle amazing!! also, "fun" exploit (modern evals really is just patching all the possible exploits...)

curious: did a model reach your 20M token limit?

30d102

Florian Brand@xeophon

@htihle i like turn gating a bit more (vibes-based), followed by a token limit. worst is wall-clock time, it is just bad

and for infinite money, i like a no-regression / loop detection setup, i.e., a model may run forever unless it does not improve its output for X iterations

30d141

Håvard Ihle@htihle

@xeophon Very rarely, I think it happened like 2 times or so in the whole run (two runs per 17 tasks for two models), mostly they used less than 10M, quite a few between 10 and 20M. I think I'll use token gating or something like that in the future.

30d131

Håvard Ihle@htihle

@xeophon Thanks for the input, I'll test out both token and turn gating. I'll also probably go higher than 20M for WeirdML v3, but we'll see.

30d71

Håvard Ihle@htihle

Here is the website with more details: https://htihle.github.io/agentic_weirdml.html

30d362

Florian Brand@xeophon

@htihle they obviously all have their drawbacks, but i feel like models optimize more for parallel tool calls these days vs. raw per-dollar token efficiency. maybe i can have a data-based opinion after a ton more runs on hosted evals + swe benches :)

30d51

Ali Naeimi@Ali_NT99

@htihle Would be great to see how Gemma4 31B performs in this as that’s the sota for consumer gpus…

30d4