1h ago

Prime Intellect's @kalomaze built an agent evaluation benchmark finding GPT-5.5 leads performance while DeepSeek v4-flash dominates cost efficiency

DeepSeek v4-flash cost just $0.0007 per rollout.

0
Original post

lol i think i whipped up the simplest, dumbest agent env that a. (broadly) sorts the wheat from the chaff b. also shows you if a model has strong overthinking tendencies by default as is usual, v4 flash is busy paretomogging w.r.t the cost/quality frontier

9:00 PM · May 29, 2026 View on X

@kalomaze how's pro? It should be <3x as expensive and stronger

kalomazekalomaze@kalomaze

cc @teortaxesTex it's a fairly simple probe reverse engineered from my personal agent grievances... but MAN this chart is so funny [relevant task axis: "model's ability to realize when a local change requires exacting multihop global changes"]

4:12 AM · May 30, 2026 · 2.2K Views
4:30 AM · May 30, 2026 · 224 Views

@kalomaze interesting, this is one of the cases where Pro vs Flash are meaningfully differentiated then. What about cost?

kalomazekalomaze@kalomaze

@teortaxesTex seems to be within variance of the other True Frontier Models (important caveat: i deliberately scoped this to a short horizon problem shape where you can still notice dumber agents doing myopic shit)

4:40 AM · May 30, 2026 · 142 Views
4:44 AM · May 30, 2026 · 109 Views

V4-Flash is so good I'm worried whether they'll be able to push it even further in V4.1 (without doubling token costs at least)

kalomazekalomaze@kalomaze

cc @teortaxesTex it's a fairly simple probe reverse engineered from my personal agent grievances... but MAN this chart is so funny [relevant task axis: "model's ability to realize when a local change requires exacting multihop global changes"]

4:12 AM · May 30, 2026 · 2.2K Views
4:26 AM · May 30, 2026 · 1.3K Views

cc @teortaxesTex it's a fairly simple probe reverse engineered from my personal agent grievances... but MAN this chart is so funny [relevant task axis: "model's ability to realize when a local change requires exacting multihop global changes"]

kalomazekalomaze@kalomaze

lol i think i whipped up the simplest, dumbest agent env that a. (broadly) sorts the wheat from the chaff b. also shows you if a model has strong overthinking tendencies by default as is usual, v4 flash is busy paretomogging w.r.t the cost/quality frontier

4:00 AM · May 30, 2026 · 2.1K Views
4:12 AM · May 30, 2026 · 2.2K Views

@teortaxesTex Realizing a Change Has Nth Order Consequences Can't Be This G-Loaded!

kalomazekalomaze@kalomaze

cc @teortaxesTex it's a fairly simple probe reverse engineered from my personal agent grievances... but MAN this chart is so funny [relevant task axis: "model's ability to realize when a local change requires exacting multihop global changes"]

4:12 AM · May 30, 2026 · 2.2K Views
4:20 AM · May 30, 2026 · 548 Views

@teortaxesTex seems to be within variance of the other True Frontier Models (important caveat: i deliberately scoped this to a short horizon problem shape where you can still notice dumber agents doing myopic shit)

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

@kalomaze how's pro? It should be <3x as expensive and stronger

4:30 AM · May 30, 2026 · 224 Views
4:40 AM · May 30, 2026 · 142 Views

@teortaxesTex seemingly more token efficient & also better tool call discipline (should be exactly 2 avg in the best case, qwen here in particular had some outliers that spammed redundant writes)

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

@kalomaze interesting, this is one of the cases where Pro vs Flash are meaningfully differentiated then. What about cost?

4:44 AM · May 30, 2026 · 109 Views
4:50 AM · May 30, 2026 · 138 Views

@teortaxesTex 5.4 mini might look like a regression here but i think this is probably a "doesn't default to a reasonable reasoning effort level" situation no clue if nano even does reasoning to begin with?

kalomazekalomaze@kalomaze

@teortaxesTex seemingly more token efficient & also better tool call discipline (should be exactly 2 avg in the best case, qwen here in particular had some outliers that spammed redundant writes)

4:50 AM · May 30, 2026 · 138 Views
4:59 AM · May 30, 2026 · 12 Views

@teortaxesTex 5.4 mini might look like a regression here (chooses not to reason?) but i think this is probably a "doesn't default to a reasonable effort level" situation also; no clue if nano even does reasoning to begin with but i presume not lol?

kalomazekalomaze@kalomaze

@teortaxesTex seemingly more token efficient & also better tool call discipline (should be exactly 2 avg in the best case, qwen here in particular had some outliers that spammed redundant writes)

4:50 AM · May 30, 2026 · 138 Views
5:01 AM · May 30, 2026 · 93 Views
Prime Intellect's @kalomaze built an agent evaluation benchmark finding GPT-5.5 leads performance while DeepSeek v4-flash dominates cost efficiency · Digg