/Tech10d ago

Apollo Research's Marius Hobbhahn warns AI capabilities have outpaced safety evaluations, leaving testers out of tasks

An unnamed nonprofit has committed $25 billion to AI resilience.

--0--

#595

Original post

Marius Hobbhahn@MariusHobbhahn#1249inTech

Unfortunately, I think the evals gap prediction came true.

Evals have made progress, but capabilities have made even more progress in the same time.

METR running out of long-horizon tasks is a good example for that.

Apollo Research@apolloaievals

The quality and quantity of evals required to make rigorous safety statements could outpace available evals. We explain “the evals gap” and what would be required to close it.

https://www.apolloresearch.ai/blog/evalsgap

4:37 AM · Jun 2, 2026 · 8.1K Views

Sentiment

Sentiment building, check back later.

Cluster Engagement

Views

Comments

Reposts

Bookmarks

Expand data

Posts from X

Most Activity

VIEWS570LIKES3REPLIES1

⿻ Andrew Trask@iamtrask

@MariusHobbhahn @traintest_split If the evals stayed private from everyone, they’d last longer. Lab members can see the test and design RLHF/Mechanical Turk armies to swarm them.

Marius Hobbhahn@MariusHobbhahn

Unfortunately, I think the evals gap prediction came true.

Evals have made progress, but capabilities have made even more progress in the same time.

METR running out of long-horizon tasks is a good example for that.

10d57030

David Manheim@davidmanheim

@MariusHobbhahn There's a nonprofit that has a $25b commitment to AI resilience, and a for-profit with some large donors coming up later this year - how quickly can you scale, and how would you expand the ecosystem if funding were not a barrier?

10d11