Prime Intellect's @kalomaze built an agent evaluation benchmark finding GPT-5.5 leads performance while DeepSeek v4-flash dominates cost efficiency

REPLY

#420Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@TEORTAXESTEX

@kalomaze how's pro? It should be <3x as expensive and stronger

kalomaze@kalomaze

cc @teortaxesTex it's a fairly simple probe reverse engineered from my personal agent grievances... but MAN this chart is so funny [relevant task axis: "model's ability to realize when a local change requires exacting multihop global changes"]

4:12 AM · May 30, 2026 · 2.2K Views

4:30 AM · May 30, 2026 · 224 Views

REPLY

#420Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@TEORTAXESTEX

@kalomaze interesting, this is one of the cases where Pro vs Flash are meaningfully differentiated then. What about cost?

kalomaze@kalomaze

@teortaxesTex seems to be within variance of the other True Frontier Models (important caveat: i deliberately scoped this to a short horizon problem shape where you can still notice dumber agents doing myopic shit)

4:40 AM · May 30, 2026 · 142 Views

4:44 AM · May 30, 2026 · 109 Views

QUOTE POST

#420Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@TEORTAXESTEX

V4-Flash is so good I'm worried whether they'll be able to push it even further in V4.1 (without doubling token costs at least)

kalomaze@kalomaze

cc @teortaxesTex it's a fairly simple probe reverse engineered from my personal agent grievances... but MAN this chart is so funny [relevant task axis: "model's ability to realize when a local change requires exacting multihop global changes"]

4:12 AM · May 30, 2026 · 2.2K Views

4:26 AM · May 30, 2026 · 1.3K Views

REPLY

#836kalomaze@KALOMAZE

cc @teortaxesTex it's a fairly simple probe reverse engineered from my personal agent grievances... but MAN this chart is so funny [relevant task axis: "model's ability to realize when a local change requires exacting multihop global changes"]

kalomaze@kalomaze

lol i think i whipped up the simplest, dumbest agent env that a. (broadly) sorts the wheat from the chaff b. also shows you if a model has strong overthinking tendencies by default as is usual, v4 flash is busy paretomogging w.r.t the cost/quality frontier

4:00 AM · May 30, 2026 · 2.1K Views

4:12 AM · May 30, 2026 · 2.2K Views

REPLY

#836kalomaze@KALOMAZE

@teortaxesTex Realizing a Change Has Nth Order Consequences Can't Be This G-Loaded!

kalomaze@kalomaze

cc @teortaxesTex it's a fairly simple probe reverse engineered from my personal agent grievances... but MAN this chart is so funny [relevant task axis: "model's ability to realize when a local change requires exacting multihop global changes"]

4:12 AM · May 30, 2026 · 2.2K Views

4:20 AM · May 30, 2026 · 548 Views

REPLY

#836kalomaze@KALOMAZE

@teortaxesTex seems to be within variance of the other True Frontier Models (important caveat: i deliberately scoped this to a short horizon problem shape where you can still notice dumber agents doing myopic shit)

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

@kalomaze how's pro? It should be <3x as expensive and stronger

4:30 AM · May 30, 2026 · 224 Views

4:40 AM · May 30, 2026 · 142 Views

REPLY

#836kalomaze@KALOMAZE

@teortaxesTex seemingly more token efficient & also better tool call discipline (should be exactly 2 avg in the best case, qwen here in particular had some outliers that spammed redundant writes)

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

@kalomaze interesting, this is one of the cases where Pro vs Flash are meaningfully differentiated then. What about cost?

4:44 AM · May 30, 2026 · 109 Views

4:50 AM · May 30, 2026 · 138 Views

REPLY

#836kalomaze@KALOMAZE

@teortaxesTex 5.4 mini might look like a regression here but i think this is probably a "doesn't default to a reasonable reasoning effort level" situation no clue if nano even does reasoning to begin with?

kalomaze@kalomaze

@teortaxesTex seemingly more token efficient & also better tool call discipline (should be exactly 2 avg in the best case, qwen here in particular had some outliers that spammed redundant writes)

4:50 AM · May 30, 2026 · 138 Views

4:59 AM · May 30, 2026 · 12 Views

REPLY

#836kalomaze@KALOMAZE

@teortaxesTex 5.4 mini might look like a regression here (chooses not to reason?) but i think this is probably a "doesn't default to a reasonable effort level" situation also; no clue if nano even does reasoning to begin with but i presume not lol?

kalomaze@kalomaze

@teortaxesTex seemingly more token efficient & also better tool call discipline (should be exactly 2 avg in the best case, qwen here in particular had some outliers that spammed redundant writes)

4:50 AM · May 30, 2026 · 138 Views

5:01 AM · May 30, 2026 · 93 Views

Prime Intellect's @kalomaze built an agent evaluation benchmark finding GPT-5.5 leads performance while DeepSeek v4-flash dominates cost efficiency

Sentiment

Cluster engagement