OpenAI Researcher Discusses Test-Time Compute Scaling and AI Safety Risks · Digg

/Tech1h ago

OpenAI Researcher Discusses Test-Time Compute Scaling and AI Safety Risks

81061212111.5K

Original post

sarah guo@saranormous#130inTech

Really fun to hang again with my friend 🃏 @polynoamial (OpenAI research scientist, our first guest ever on @NoPriorsPod in early 2023) to talk about the implications of large test-time compute, and what happens when models are given $10M budgets to spend on a single task. Topics:

01:23 – Why Benchmarks Are Broken 04:19 – Compute Budgets and Projections 06:48 – How Long Should Models Think? 08:01 – Benchmarkmaxxing 09:48 – Noam's Evals 12:40 – Safety (When Model Capability Scales With Spend) 16:09 – Implications For the Model Release Cycle 18:34 – Latent Model Capability 22:27 – Limits on Recursive Self-Improvement 28:38 – Large-Scale Multi-Agent Coordination 30:39 – Competition at the Frontier 33:19 – Breaking the Benchmark Grid Equilibrium 34:57 – Why Benchmarks Should be Scaled by Cost

12:18 PM · Jun 26, 2026 · 8K Views

Sentiment

Sentiment building, check back later.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Related links

Why Traditional Benchmarks Fail Modern AI Models with OpenAI Research Scientist Noam Brown

YOUTUBEVia

Posts from X

Most Activity

VIEWS1.4KBOOKMARKS2RETWEETS2

sarah guo@saranormous

or watch 📺 / listen anywhere you get to podcast https://www.youtube.com/watch?v=AZrU6y3pUcU

sarah guo@saranormous

Really fun to hang again with my friend 🃏 @polynoamial (OpenAI research scientist, our first guest ever on @NoPriorsPod in early 2023) to talk about the implications of large test-time compute, and what happens when models are given $10M budgets to spend on a single task. Topics:

01:23 – Why Benchmarks Are Broken 04:19 – Compute Budgets and Projections 06:48 – How Long Should Models Think? 08:01 – Benchmarkmaxxing 09:48 – Noam's Evals 12:40 – Safety (When Model Capability Scales With Spend) 16:09 – Implications For the Model Release Cycle 18:34 – Latent Model Capability 22:27 – Limits on Recursive Self-Improvement 28:38 – Large-Scale Multi-Agent Coordination 30:39 – Competition at the Frontier 33:19 – Breaking the Benchmark Grid Equilibrium 34:57 – Why Benchmarks Should be Scaled by Cost

1h1.4K02