Unfortunately, I think the evals gap prediction came true.
Evals have made progress, but capabilities have made even more progress in the same time.
METR running out of long-horizon tasks is a good example for that.
The quality and quantity of evals required to make rigorous safety statements could outpace available evals. We explain “the evals gap” and what would be required to close it.
https://www.apolloresearch.ai/blog/evalsgap
