> Here is my untested hypothesis: Opus 4.8 would score higher if we gave it more tool calling budget A reasonable guess. V4 is good when you don't need a lot of tool calls, but can't really strategize for massive sessions. Ironically, this might mean that V4.1 will be "worse"
valid call out! we are gonna try to get some third party data points ASAP.
added context from our PM who ran the benchmarks (note that all models ran in the same server side tool calling setup)
“I am floored by how well DeepSeek scored.
Here is my untested hypothesis: Opus 4.8 would score higher if we gave it more tool calling budget. I think it's hungrier and performs better with a long time and a lot of tool use.
Fable seemed way better at using the tool call budget judiciously and thinking for much longer.
We needed those budgets because the fusion calls don't run in a true long-running harness. If we ran the benchmarks in a managed agents style environment, I bet Opus 4.8 would easy surpass Deepseek, both in score and spend.”
