Prime Intellect's Florian Brand says running MirrorCode benchmark evaluations on advanced models will cost over $100,000 per run
Fully eliciting performance requires up to 100 million tokens
@cwolferesearch @tmkadamcz And ProgramBench (200 tasks) has reported something like 5-10K per model run on the high end (some tweet, hard to find on the spot). PB underelicits the capabilities, imo. So a proper run would be like 10-20K+ for one model. Ant likely spent 50-100K for the 4.8 figure
@xeophon @tmkadamcz thanks for sharing!!
@cwolferesearch @tmkadamcz That‘s raw API costs, add the costs of hundreds of sandboxes running for hours or days on top. Small in the grand scheme of things rn, but something to consider. CPUs and RAM are also resources these days
@cwolferesearch @tmkadamcz And ProgramBench (200 tasks) has reported something like 5-10K per model run on the high end (some tweet, hard to find on the spot). PB underelicits the capabilities, imo. So a proper run would be like 10-20K+ for one model. Ant likely spent 50-100K for the 4.8 figure
@cwolferesearch @tmkadamcz And, last cost-based post from the same talk: Based on public information, you can calculate the cost of evals like APEX-Agents or RLI (iirc RLI has something like 20-30K in costs for the data acquisition alone)

@cwolferesearch @tmkadamcz That‘s raw API costs, add the costs of hundreds of sandboxes running for hours or days on top. Small in the grand scheme of things rn, but something to consider. CPUs and RAM are also resources these days
@xeophon @tmkadamcz thanks for sharing!!
@cwolferesearch Also, it’s so damn costly! From a talk of mine, based on @tmkadamcz's information/calculation of running MirrorCode with a bunch of models: