1h ago

Prime Intellect's Florian Brand says running MirrorCode benchmark evaluations on advanced models will cost over $100,000 per run

Fully eliciting performance requires up to 100 million tokens

0
Original post

@cwolferesearch Also, it’s so damn costly! From a talk of mine, based on @tmkadamcz's information/calculation of running MirrorCode with a bunch of models:

3:07 PM · May 30, 2026 View on X

@cwolferesearch @tmkadamcz And ProgramBench (200 tasks) has reported something like 5-10K per model run on the high end (some tweet, hard to find on the spot). PB underelicits the capabilities, imo. So a proper run would be like 10-20K+ for one model. Ant likely spent 50-100K for the 4.8 figure

Cameron R. Wolfe, Ph.D.Cameron R. Wolfe, Ph.D.@cwolferesearch

@xeophon @tmkadamcz thanks for sharing!!

10:12 PM · May 30, 2026 · 27 Views
10:15 PM · May 30, 2026 · 31 Views

@cwolferesearch @tmkadamcz That‘s raw API costs, add the costs of hundreds of sandboxes running for hours or days on top. Small in the grand scheme of things rn, but something to consider. CPUs and RAM are also resources these days

Florian BrandFlorian Brand@xeophon

@cwolferesearch @tmkadamcz And ProgramBench (200 tasks) has reported something like 5-10K per model run on the high end (some tweet, hard to find on the spot). PB underelicits the capabilities, imo. So a proper run would be like 10-20K+ for one model. Ant likely spent 50-100K for the 4.8 figure

10:15 PM · May 30, 2026 · 31 Views
10:16 PM · May 30, 2026 · 29 Views

@cwolferesearch @tmkadamcz And, last cost-based post from the same talk: Based on public information, you can calculate the cost of evals like APEX-Agents or RLI (iirc RLI has something like 20-30K in costs for the data acquisition alone)

Florian BrandFlorian Brand@xeophon

@cwolferesearch @tmkadamcz That‘s raw API costs, add the costs of hundreds of sandboxes running for hours or days on top. Small in the grand scheme of things rn, but something to consider. CPUs and RAM are also resources these days

10:16 PM · May 30, 2026 · 29 Views
10:28 PM · May 30, 2026 · 29 Views

@xeophon @tmkadamcz thanks for sharing!!

Florian BrandFlorian Brand@xeophon

@cwolferesearch Also, it’s so damn costly! From a talk of mine, based on @tmkadamcz's information/calculation of running MirrorCode with a bunch of models:

10:07 PM · May 30, 2026 · 177 Views
10:12 PM · May 30, 2026 · 27 Views
Prime Intellect's Florian Brand says running MirrorCode benchmark evaluations on advanced models will cost over $100,000 per run · Digg