/Tech2h ago

DeepSeek Scores Strongly In Tool Calling Benchmarks Against Opus 4.8

226013.3K

Original post

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex#501inTech

> Here is my untested hypothesis: Opus 4.8 would score higher if we gave it more tool calling budget A reasonable guess. V4 is good when you don't need a lot of tool calls, but can't really strategize for massive sessions. Ironically, this might mean that V4.1 will be "worse"

Toven@pingToven

valid call out! we are gonna try to get some third party data points ASAP.

added context from our PM who ran the benchmarks (note that all models ran in the same server side tool calling setup)

“I am floored by how well DeepSeek scored.

Here is my untested hypothesis: Opus 4.8 would score higher if we gave it more tool calling budget. I think it's hungrier and performs better with a long time and a lot of tool use.

Fable seemed way better at using the tool call budget judiciously and thinking for much longer.

We needed those budgets because the fusion calls don't run in a true long-running harness. If we ran the benchmarks in a managed agents style environment, I bet Opus 4.8 would easy surpass Deepseek, both in score and spend.”

11:47 PM · Jun 13, 2026 · 2.3K Views

Sentiment

Sentiment building, check back later.

Cluster Engagement

Posts from X

Most Activity

VIEWS990BOOKMARKS1LIKES10

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

This is the new version of CoT length creep: tool call intensity Top-tier models, once again, will be better at keeping it short and to the point, while the catch-up team will compensate in RL by exploding the budget I hope to be wrong

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

2h990101

月gate白噪漂移85返@ProSportNews247

@teortaxesTex 工具调用上限确实限制了它发挥上限

2h2