/AI2h ago

Prime Intellect's Florian Brand warns AI agent benchmarks could saturate this year as models master generic software engineering tasks

METR evaluates specialized domains like machine learning and cybersecurity.

3700739

Comments

#975

Original post

Florian Brand@xeophon#1117inAI

@scaling01 tasks are different, yes

Lisan al Gaib@scaling01

Cognition made a long time-horizon benchmark that should be good up to ~64 hours

Mythos had 16+ hour time horizons in April, so there's ~2 more doublings or ~210 days after Mythos until the benchmark saturates

meaning the benchmark is cooked before the end-of-the year

(unless the task distribution is different. we have already seen how time-horizons differ on different coding benchmarks)

1:18 PM · Jun 4, 2026 · 421 Views

/AI2h ago

Prime Intellect's Florian Brand warns AI agent benchmarks could saturate this year as models master generic software engineering tasks

METR evaluates specialized domains like machine learning and cybersecurity.

--0--

Comments

#975

Original post

Florian Brand@xeophon#1117inAI

@scaling01 tasks are different, yes

Lisan al Gaib@scaling01

Cognition made a long time-horizon benchmark that should be good up to ~64 hours

Mythos had 16+ hour time horizons in April, so there's ~2 more doublings or ~210 days after Mythos until the benchmark saturates

meaning the benchmark is cooked before the end-of-the year

(unless the task distribution is different. we have already seen how time-horizons differ on different coding benchmarks)

1:18 PM · Jun 4, 2026 · 421 Views

Sentiment

Sentiment building, check back later.

Cluster Engagement

Sentiment

Sentiment building, check back later.

Cluster Engagement

Views

Comments

Reposts

Bookmarks

Expand data

Posts from X

Most Activity

VIEWS206LIKES3

Florian Brand@xeophon

@scaling01 METR dataset: ML eng, GPU kernels, cybersecurity

Cog dataset: real life java/typescript/python/c# feature dev, bugfixes, migrations

From the QT

Florian Brand@xeophon

@scaling01 tasks are different, yes

2h20630

REPLIES1

Lisan al Gaib@scaling01

@xeophon I guess generic SWE stuff is easier, so longer time horizons for these models, meaning earlier saturation?

Florian Brand@xeophon

@scaling01 METR dataset: ML eng, GPU kernels, cybersecurity

Cog dataset: real life java/typescript/python/c# feature dev, bugfixes, migrations

From the QT

Posts from X

Most Activity

VIEWS206LIKES3

Florian Brand@xeophon

@scaling01 METR dataset: ML eng, GPU kernels, cybersecurity

Cog dataset: real life java/typescript/python/c# feature dev, bugfixes, migrations

From the QT

Florian Brand@xeophon

@scaling01 tasks are different, yes

2h20630

REPLIES1

Lisan al Gaib@scaling01

@xeophon I guess generic SWE stuff is easier, so longer time horizons for these models, meaning earlier saturation?

Florian Brand@xeophon

@scaling01 METR dataset: ML eng, GPU kernels, cybersecurity

Cog dataset: real life java/typescript/python/c# feature dev, bugfixes, migrations

From the QT

1h6800