1d ago

Creator Teortaxes says frontier AI labs spend up to $15 billion each on highly inefficient training data

A browser-use dataset for SAP reportedly costs $500,000.

0
Original post

@xlr8harder you know it, i know it, everyone knows it

12:46 AM · May 30, 2026 View on X
Reposted by

@ChrisPainterYup @hamandcheese what do you mean? just reverse engineer and decompile steam games

Chris PainterChris Painter@ChrisPainterYup

It's been really hard for METR to find vendors that can sell us long-horizon hard tasks that we can actually use.

5:59 PM · May 30, 2026 · 37K Views
1:22 AM · May 31, 2026 · 1.8K Views

It's been really hard for METR to find vendors that can sell us long-horizon hard tasks that we can actually use.

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

> "Really good long horizon tasks go up to $20,000 each. A complete browser-use version of SAP was rumored at $500,000." I think this points to market inefficiency

8:31 AM · May 30, 2026 · 64.9K Views
5:59 PM · May 30, 2026 · 37K Views

@ChrisPainterYup did you buy some?

Chris PainterChris Painter@ChrisPainterYup

It's been really hard for METR to find vendors that can sell us long-horizon hard tasks that we can actually use.

5:59 PM · May 30, 2026 · 37K Views
6:06 PM · May 30, 2026 · 759 Views

@ChrisPainterYup What's the limiting factor

Chris PainterChris Painter@ChrisPainterYup

It's been really hard for METR to find vendors that can sell us long-horizon hard tasks that we can actually use.

5:59 PM · May 30, 2026 · 37K Views
12:55 AM · May 31, 2026 · 1.3K Views

If this is true, then I really hope we have some better public evaluation benchmarks that are released as a result of this spending. Even top benchmarks for coding agents are usually super small. For example: - DeepSWE has 113 tasks. - TerminalBench-2.0 has 89 tasks. - SWE-EVO has 48 tasks. - SWE-Bench-Verified has 500 tasks.

Exceptions exist like SWE-Bench Pro / LiveCodeBench that have >1K tasks (although LiveCodeBench is usually evaluated over a small subset of <200 examples). However, the fact that important benchmarks are usually so small creates a lot of opportunity for noise and skews the measurement / interpretation of eval results. Very difficult to apply error bars or any other form of uncertainty estimation in this scenario.

Investing in better public benchmarks would likely be a tiny ratio of this spend but would have an outsized impact on the overall trajectory of coding agents.

9:14 PM · May 30, 2026 · 10.2K Views
Creator Teortaxes says frontier AI labs spend up to $15 billion each on highly inefficient training data · Digg