/Tech5h ago

Microsoft Research's Dylan Foster says Fable 5 Incognito provided GRPO training advice without triggering safety refusals

The AI suggested advantage estimation to stabilize frontier model training.

1521751518.8K

#203

Original post

Dylan Foster 🐢@canondetortugas#1033inTech

Bad news for GRPO...didn't get refused or routed to Opus.

7:06 AM · Jun 10, 2026 · 19.2K Views

/Tech5h ago

Microsoft Research's Dylan Foster says Fable 5 Incognito provided GRPO training advice without triggering safety refusals

The AI suggested advantage estimation to stabilize frontier model training.

1521751518.8K

#203

Original post

Dylan Foster 🐢@canondetortugas#1033inTech

Bad news for GRPO...didn't get refused or routed to Opus.

7:06 AM · Jun 10, 2026 · 19.2K Views

Sentiment

Many users accused Anthropic of secretly nerfing GRPO via hidden steering vectors and weight changes to degrade responses without detection while stifling competition.

Pos

12.5%

Neg

87.5%

8 comments with sentiment.

Cluster Engagement

Posts from X

Most Activity

VIEWS1.3K

George Shaikovski@gshaikovski

@canondetortugas Proceeds to recommend GSPO with MOE

3h1.3K5

BOOKMARKS1REPLIES2

Rupak@rupak_53

@canondetortugas I think you can get an internally suboptimal response without getting redirected to Opus or refused, though. The refusal is for biology/cyber sec, but weirdly (and disingenuously) not for frontier LLM research

4h5231

LIKES33

Axel Pond@axel_pond

@canondetortugas @DimitrisPapail You wont know though, they made the AI research censorship secret. They just add a steering vector to Fable to make it useless. You pay the full cost and never know

3h97433

RETWEETS3

Dylan Foster 🐢@canondetortugas

Bad news for GRPO...didn't get refused or routed to Opus.

5h19.2K22615

Olcan@olcan

@canondetortugas it is silent/hidden nerfing

4h6317

Dimitris Papailiopoulos@DimitrisPapail

@axel_pond @canondetortugas It’s ok GRPO will be untouched by the steering vector

3h1373

ueaj@_ueaj

@canondetortugas PPO supremacy

4h5704

Amin Karbasi@aminkarbasi

@canondetortugas @DimitrisPapail 🤣

Dylan Foster 🐢@canondetortugas

Bad news for GRPO...didn't get refused or routed to Opus.

3h1K20

Ishaq Aden-Ali@AdenIshaq

@canondetortugas Lol

4h4971

Mariusz Kurman@mkurman88

@canondetortugas @DimitrisPapail You crack the code, Dylan, just wait for an email about account suspension

3h821

Shailesh@0xThoughtVector

@canondetortugas you don't get to know when sub-par responses are shown for AI/ML. th is is not what Fable thinks of GRPO

1h2011

Joe Norton@joenorton

@canondetortugas you wouldn't know, anthropic is using the silent sandbag to stifle competition

1h186

George Shaikovski@gshaikovski

@canondetortugas Proceeds to speculate that it is "simple, robust policy-gradient cores (REINFORCE/PPO-family with importance-ratio corrections of the kind we discussed ... the differentiator at frontier labs is environments, reward signal quality, and infrastructure, not the surrogate objective"

3h115

Beff (e/acc)@beffjezos

@canondetortugas Lmao

34m63

Federico Andres Lois@federicolois

@DimitrisPapail @axel_pond @canondetortugas Depends if you are looking for new stuff or not. I wouldn't put my hands on the fire as a bet. Would you?

1h20

Braham@braham_snyder

@rupak_53 @canondetortugas +1, sounds like it's silent sandbagging:

https://www-cdn.anthropic.com/d00db56fa754a1b115b6dd7cb2e3c342ee809620.pdf

4h5

David@David97717063

@canondetortugas They said in their press release they modified the weights to make it dumber on AI questions so people won't realize it has been degraded.

25m