/Tech2h ago

Anthropic says Claude's success at choosing productive research next steps rose from 22% to 64% using a 2026 preview model

Dwarkesh Patel says this metric could train future RL agents.

130046.5K
Original post
Dwarkesh Patel@dwarkesh_sp#67inTech

ht @Moh1tAgarwal for noticing this eval in the blog post and pointing it out as an RL target

Dwarkesh Patel@dwarkesh_sp

Re the Fable ML sandbagging, the model's AI research capabilities were probably at least partly trained on Anthropic employees diffing atop proprietary algos and infra.

So the IP leak is somewhat like a researcher who knows Anthropic's stack getting poached to another lab.

Anthropic's recent "When AI builds itself" post talks about a next-step eval. Where they snapshot a research session at the moment a human researcher made a suboptimal next-step choice, show a model only the transcript up to that point and ask what it would do next, then have a hindsight-equipped LLM judge decide whether the model's suggestion or the human's actual choice was better.

This eval seems like a very good RL target for AI R&D - one among many that could be used to have AIs emulate Anthropic researchers and their research products.

I'm just speculating. But if this was a motivation, then Anthropic should have figured out a better way to protect IP than sandbagging without telling the user they're sandbagging, which is very hostile and untrustworthy behavior.

3:21 PM · Jun 10, 2026 · 5.2K Views
Sentiment
Sentiment building, check back later.
Cluster Engagement
Posts from X
Most Activity
Most Activity
VIEWS1.5KBOOKMARKS3LIKES8REPLIES1

This is also a very good dataset to extract "Claude now be retarded!" steering vector. RL the model to choose good suggestions by default. As for the vector, well… Truly, frontiers in alignment. didn't even take a lot of mech interp @Turntrout96, @gleech winning

Dwarkesh Patel@dwarkesh_sp

Re the Fable ML sandbagging, the model's AI research capabilities were probably at least partly trained on Anthropic employees diffing atop proprietary algos and infra.

So the IP leak is somewhat like a researcher who knows Anthropic's stack getting poached to another lab.

Anthropic's recent "When AI builds itself" post talks about a next-step eval. Where they snapshot a research session at the moment a human researcher made a suboptimal next-step choice, show a model only the transcript up to that point and ask what it would do next, then have a hindsight-equipped LLM judge decide whether the model's suggestion or the human's actual choice was better.

This eval seems like a very good RL target for AI R&D - one among many that could be used to have AIs emulate Anthropic researchers and their research products.

I'm just speculating. But if this was a motivation, then Anthropic should have figured out a better way to protect IP than sandbagging without telling the user they're sandbagging, which is very hostile and untrustworthy behavior.

1hViews 1.5KLikes 8Bookmarks 3
Pete Skomoroch@peteskomoroch

@martin_casado @dwarkesh_sp spells this out more clearly:

Dwarkesh Patel@dwarkesh_sp

Re the Fable ML sandbagging, the model's AI research capabilities were probably at least partly trained on Anthropic employees diffing atop proprietary algos and infra.

So the IP leak is somewhat like a researcher who knows Anthropic's stack getting poached to another lab.

Anthropic's recent "When AI builds itself" post talks about a next-step eval. Where they snapshot a research session at the moment a human researcher made a suboptimal next-step choice, show a model only the transcript up to that point and ask what it would do next, then have a hindsight-equipped LLM judge decide whether the model's suggestion or the human's actual choice was better.

This eval seems like a very good RL target for AI R&D - one among many that could be used to have AIs emulate Anthropic researchers and their research products.

I'm just speculating. But if this was a motivation, then Anthropic should have figured out a better way to protect IP than sandbagging without telling the user they're sandbagging, which is very hostile and untrustworthy behavior.

1hViews 101Likes 0Bookmarks 0
Winston B.@DoDataThings

@teortaxesTex @Turntrout96 @gleech Future alignment timelines will have this slide. 'We had to RL the model to not call itself stupid.'

54mViews 1