Creator Teortaxes says frontier AI labs spend up to $15 billion each on highly inefficient training data
A browser-use dataset for SAP reportedly costs $500,000.
> "Really good long horizon tasks go up to $20,000 each. A complete browser-use version of SAP was rumored at $500,000." I think this points to market inefficiency
@ChrisPainterYup @hamandcheese what do you mean? just reverse engineer and decompile steam games
It's been really hard for METR to find vendors that can sell us long-horizon hard tasks that we can actually use.
Secret sauce of frontier models.
It's been really hard for METR to find vendors that can sell us long-horizon hard tasks that we can actually use.
> "Really good long horizon tasks go up to $20,000 each. A complete browser-use version of SAP was rumored at $500,000." I think this points to market inefficiency
@ChrisPainterYup did you buy some?
It's been really hard for METR to find vendors that can sell us long-horizon hard tasks that we can actually use.
@ChrisPainterYup What's the limiting factor
It's been really hard for METR to find vendors that can sell us long-horizon hard tasks that we can actually use.
If this is true, then I really hope we have some better public evaluation benchmarks that are released as a result of this spending. Even top benchmarks for coding agents are usually super small. For example: - DeepSWE has 113 tasks. - TerminalBench-2.0 has 89 tasks. - SWE-EVO has 48 tasks. - SWE-Bench-Verified has 500 tasks.
Exceptions exist like SWE-Bench Pro / LiveCodeBench that have >1K tasks (although LiveCodeBench is usually evaluated over a small subset of <200 examples). However, the fact that important benchmarks are usually so small creates a lot of opportunity for noise and skews the measurement / interpretation of eval results. Very difficult to apply error bars or any other form of uncertainty estimation in this scenario.
Investing in better public benchmarks would likely be a tiny ratio of this spend but would have an outsized impact on the overall trajectory of coding agents.