1d ago

Creator Teortaxes says frontier AI labs spend up to $15 billion each on highly inefficient training data

A browser-use dataset for SAP reportedly costs $500,000.

2849715245117.5K

——0——

Original post

#1153Florian Brand@XEOPHON

@xlr8harder you know it, i know it, everyone knows it

12:46 AM · May 30, 2026

Reposted by

#1488@HAMANDCHEESE

QUOTE POST

#420Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@TEORTAXESTEX

> "Really good long horizon tasks go up to $20,000 each. A complete browser-use version of SAP was rumored at $500,000." I think this points to market inefficiency

8:31 AM · May 30, 2026 · 64.9K Views

#488kache@YACINEMTB

@ChrisPainterYup @hamandcheese what do you mean? just reverse engineer and decompile steam games

Chris Painter@ChrisPainterYup

It's been really hard for METR to find vendors that can sell us long-horizon hard tasks that we can actually use.

5:59 PM · May 30, 2026 · 37K Views

1:22 AM · May 31, 2026 · 1.8K Views

QUOTE POST

#1073Shital Shah@SYTELUS

Secret sauce of frontier models.

6:50 AM · May 31, 2026 · 211 Views

QUOTE POST

#1092Chris Painter@CHRISPAINTERYUP

It's been really hard for METR to find vendors that can sell us long-horizon hard tasks that we can actually use.

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

> "Really good long horizon tasks go up to $20,000 each. A complete browser-use version of SAP was rumored at $500,000." I think this points to market inefficiency

8:31 AM · May 30, 2026 · 64.9K Views

5:59 PM · May 30, 2026 · 37K Views

#1153Florian Brand@XEOPHON

@ChrisPainterYup did you buy some?

Chris Painter@ChrisPainterYup

It's been really hard for METR to find vendors that can sell us long-horizon hard tasks that we can actually use.

5:59 PM · May 30, 2026 · 37K Views

6:06 PM · May 30, 2026 · 759 Views

#1220rohit@KRISHNANROHIT

@ChrisPainterYup What's the limiting factor

Chris Painter@ChrisPainterYup

It's been really hard for METR to find vendors that can sell us long-horizon hard tasks that we can actually use.

5:59 PM · May 30, 2026 · 37K Views

12:55 AM · May 31, 2026 · 1.3K Views

QUOTE POST

#1444Cameron R. Wolfe, Ph.D.@CWOLFERESEARCH

If this is true, then I really hope we have some better public evaluation benchmarks that are released as a result of this spending. Even top benchmarks for coding agents are usually super small. For example: - DeepSWE has 113 tasks. - TerminalBench-2.0 has 89 tasks. - SWE-EVO has 48 tasks. - SWE-Bench-Verified has 500 tasks.

Exceptions exist like SWE-Bench Pro / LiveCodeBench that have >1K tasks (although LiveCodeBench is usually evaluated over a small subset of <200 examples). However, the fact that important benchmarks are usually so small creates a lot of opportunity for noise and skews the measurement / interpretation of eval results. Very difficult to apply error bars or any other form of uncertainty estimation in this scenario.

Investing in better public benchmarks would likely be a tiny ratio of this spend but would have an outsized impact on the overall trajectory of coding agents.

9:14 PM · May 30, 2026 · 10.2K Views

Creator Teortaxes says frontier AI labs spend up to $15 billion each on highly inefficient training data

Sentiment

Cluster engagement