/Tech1h ago

OpenAI frontier evaluations lead Tejal Patwardhan says existing AI benchmarks are increasingly saturated or exploited by modern models

Evaluations are critical for predicting next-generation model capabilities

906441734065.8K

#766

Original post

OpenAI@OpenAI

Let’s talk about evals.

We’re always looking for better ways to measure and forecast model progress, especially as benchmarks get saturated or gamed.

@tejalpatwardhan, who leads our frontier evals team, spoke to @andrewmayne about why evals matter and what models need to be judged on next.

10:23 AM · Jun 16, 2026 · 72.6K Views

Sentiment

Many users praised OpenAI's push for evaluations beyond gameable benchmarks and supported the podcast on frontier model progress, while others insulted current GPT performance and demanded the return of 4o.

Pos

50.0%

Neg

50.0%

12 comments with sentiment.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Posts from X

Most Activity

VIEWS16.9KBOOKMARKS6LIKES36REPLIES6

OpenAI@OpenAI

Listen to the OpenAI Podcast on—

Spotify https://open.spotify.com/episode/5pkjhUnZNSdibLKyMFdFa7?si=T9tynshWSV2puIhmDT7Axg

Apple https://podcasts.apple.com/us/podcast/why-tejal-patwardhan-stopped-underestimating-the/id1820330260?i=1000772992914

YouTube https://youtu.be/CFqjjKp9Y-Q

2h16.9K366

RETWEETS3

promptnprofit@promptnnprofit

@OpenAI @tejalpatwardhan @AndrewMayne Benchmarks are being gamed the moment they go public.

The real eval is always the same: did it solve my actual problem today?

Everything else is a leaderboard for researchers.

2h184

Chris@ChrissGPT

This kinda seems like a weak gazelle begging the lion to convince it that it’s not tasty, just accept you got mogged hard on non gameable benchmarks as well. This constant “we shouldn’t worry about benchmarks” - because muh test time scaling is just a little cringe. You guys will win on benchmarks in the end of the year anyways

2h6238

芽LA芽la@xiaomin05127352

@OpenAI @tejalpatwardhan @AndrewMayne The real challenge is that benchmark scores don't always reflect real-world performance. Would love to see more eval focus on edge cases and actual user workflows.

1h145

芽LA芽la@xiaomin05127352

@OpenAI The pace of AI capability growth has surprised even experts. Tejal's perspective on overcoming "underestimation bias" is crucial for anyone building in this space.

1h95

David Stark@stark4833

@OpenAI @tejalpatwardhan @AndrewMayne No, let’s talk about 4o. #keep4o #4oForAll

2h1299

orcus108@orcus108

@OpenAI @tejalpatwardhan @AndrewMayne We got a 3B model matching Opus 4.5 in coding.

And it’s open source. Clearly there’s more scope for innovation on the model side.

2h3131

Symbioza2025 | ASA |@Symbioza2025

This is exactly the right direction.

As benchmarks become saturated or gamed, evaluations should move beyond single-output performance.

The harder question is long-horizon behavior:

does the model preserve intent, recover after correction, resist semantic drift, stay coherent across many turns, and avoid local correctness hiding global trajectory failure?

Future evaluations should measure not only what a model answers.

They should measure where the model’s reasoning trajectory is going.

1h9711

a c@ac2334357684130

@ChrissGPT @OpenAI @tejalpatwardhan @AndrewMayne Bro you didn't even have time to watch it before commenting

1h251

Andrew Mayne@AndrewMayne

In the latest OpenAI Podcast the amazing @tejalpatwardhan talks about evals and what it's like to work at the frontier.

OpenAI@OpenAI

Let’s talk about evals.

We’re always looking for better ways to measure and forecast model progress, especially as benchmarks get saturated or gamed.

@tejalpatwardhan, who leads our frontier evals team, spoke to @andrewmayne about why evals matter and what models need to be judged on next.

1h1.8K50

EL@EL6488915246551

@OpenAI @tejalpatwardhan @AndrewMayne You only measure model progress based on criteria that are favorable to your speech. GPT-4o has a vast user base, complaining about the irreplaceable role of 4o in various scenarios cannot be continuously achieved in new models. #BringBack4o and #keep4o

1h552

Furkan Gözükara@FurkanGozukara

@OpenAI @tejalpatwardhan @AndrewMayne I don't believe evals just publish next model so I can eval it

2h5715

MrMartin@MrmartinsGaze

@OpenAI @tejalpatwardhan @AndrewMayne When 5.6 and how good it will be. That's only thing we care and wait here

2h4752

Lisan al Gaib@scaling01

@OpenAI @tejalpatwardhan @AndrewMayne I want to see the AGI eval :(

1h4523

Chris@ChrissGPT

@intentaurixyz @ac2334357684130 @OpenAI @tejalpatwardhan @AndrewMayne It’s obvious they released this right after fable and this is the third time someone from OAI is telling us not to worry about test time scaling benchmarks lol

1h291

沈希@ofnaN6MMcN57919

@OpenAI @tejalpatwardhan @AndrewMayne You know how GPT acts like a complete idiot these days?😂 I can't be bothered reading his responses anymore because they're all just flattery and pointless nonsense

45m1593

シエラ 🐾🍬@swe6340

@OpenAI @tejalpatwardhan @AndrewMayne Please implement the emotional intelligence & creativity that The 4-series had into 5.6 🙏🏼 were not all coders we need genuine depth, not a bot detached from human reality.

1h1123

Intentauri@intentaurixyz

@ac2334357684130 @ChrissGPT @OpenAI @tejalpatwardhan @AndrewMayne I doubt he did

1h17

Intentauri@intentaurixyz

@ChrissGPT @OpenAI @tejalpatwardhan @AndrewMayne Wtf are you talking about bro. I thought you were smart

1h15

3miry@3mireeee

@OpenAI @tejalpatwardhan @AndrewMayne You should run welfare related evals too

2h1382