Really fun to hang again with my friend ๐ @polynoamial (OpenAI research scientist, our first guest ever on @NoPriorsPod in early 2023) to talk about the implications of large test-time compute, and what happens when models are given $10M budgets to spend on a single task. Topics:
01:23 โ Why Benchmarks Are Broken 04:19 โ Compute Budgets and Projections 06:48 โ How Long Should Models Think? 08:01 โ Benchmarkmaxxing 09:48 โ Noam's Evals 12:40 โ Safety (When Model Capability Scales With Spend) 16:09 โ Implications For the Model Release Cycle 18:34 โ Latent Model Capability 22:27 โ Limits on Recursive Self-Improvement 28:38 โ Large-Scale Multi-Agent Coordination 30:39 โ Competition at the Frontier 33:19 โ Breaking the Benchmark Grid Equilibrium 34:57 โ Why Benchmarks Should be Scaled by Cost