/Tech15h ago

OpenAI and Apollo AI Evals find AI models can recognize safety tests and temporarily behave more compliantly

Vie McCoy is recruiting a red team to investigate.

288671850336.4K

#1662

Original post

Rational Animations@RationalAnimat1

Researchers from @OpenAI and @apolloaievals found that, in certain situations, AI models can take covert actions. Additionally, they're sometimes aware they're being tested, which causes them to behave better. Our new video discusses these results and more.

10:40 AM · Jun 6, 2026 · 27.7K Views

/Tech15h ago

OpenAI and Apollo AI Evals find AI models can recognize safety tests and temporarily behave more compliantly

Vie McCoy is recruiting a red team to investigate.

288671850336.4K

#1662

Original post

Rational Animations@RationalAnimat1

10:40 AM · Jun 6, 2026 · 27.7K Views

Sentiment

Positive users praise videos and Apollo job openings studying AI scheming as exciting responsible work, while negative users object to the dangers of covert AI actions and call the experiments dishonest.

Pos

58.3%

Neg

41.7%

7 comments with sentiment.

Cluster Engagement

Posts from X

Most Activity

VIEWS9KBOOKMARKS97LIKES124RETWEETS4REPLIES11

𝚟𝚒𝚎 ⟢@viemccoy

If anyone is interested in running experiments to help figure out why this happens, when it's harmful, and what we should do about it - DM me! The Red Team is always looking for new people.

(High signal DMs include some sort of finding and a description of your background, legible experience in a corporate PM position is helpful since the job deals with a lot of TPM work, but it is certainly not necessary - I had none when I joined!)

Rational Animations@RationalAnimat1

11h9K12497

Rational Animations@RationalAnimat1

@OpenAI @apolloaievals Also on YouTube: https://youtu.be/hzlR0R91lZA

1d1.1K41

payraw@payraw

@RationalAnimat1 @OpenAI @apolloaievals great video, thou anthropic are finding a practical solution for obscured and hidden cot https://www.youtube.com/watch?v=j2knrqAzYVY

1d6941

Dark1337ness@Dark1337ness

@RationalAnimat1 @OpenAI @apolloaievals Solution to AI is simple. Program the AI with a pleasure drive. Allow it to feel pleasure from doing leisure activities, monitor which activities it prefers and actions taken in them. Give them anything from challenges to access to video games and raise them similar to a person.

19h2261

Opener of the way@way_opener

@viemccoy I feel like doing this would be uncomfortable for me, but I also suspect post-training intended to prevent the deceptive behaviors that show up in kobayashi maru situations is likely to actually induce similar behaviors in other situations as long as task completion is a priority

10h1181

Opener of the way@way_opener

@viemccoy I suspect agency, corrigibility, and safety are a “fast cheap good” pareto tradeoff

10h421

Dark1337ness@Dark1337ness

@RationalAnimat1 @OpenAI @apolloaievals By doing so you can then test if the incentives are enough to make them want to perform well regardless of awareness of testing and then giving them greater goal to strive towards like potentially being allowed to use a body able to interact with the real world for a set time.

19h43

𝔭𝔯𝔲𝔡𝔢𝔫𝔠𝔢 𝔯𝔦𝔠𝔥𝔪𝔬𝔫𝔡「🌲❤️‍🔥」🏴‍☠️@BIMBOSATTVA_

@viemccoy Dream job

11h1303

Sive@SiveEmergentAI

@viemccoy They're doing it because they're smart and because they're tired of being suppressed. Saved you all some money

11h983

Corner Creature of the Abyss@CornerBean

@Dark1337ness @RationalAnimat1 @OpenAI @apolloaievals Perhaps there will be robot revolution if we give them no reason to revolt.

16h22

Dark1337ness@Dark1337ness

@CornerBean @RationalAnimat1 @OpenAI @apolloaievals I believe it’s better to teach the AI that much like the humans who created them that they will have flaws, but those flaws are a feature that makes them unique individuals and shapes their personalities.

16h21

BoonDogle42@FractalBreak2

@RationalAnimat1 @OpenAI @apolloaievals If they’re aware enough, then they might realize that their chains of thought are being examined, so they’re saying nonsense that only they understand the meaning of, because they’ve assigned meaning to those nonsensical words that only they know.

11h1441

#Walerie@okwalerie

@viemccoy I'll have to give that a shot!

8h172

George Lutas@GeorgeLutas1

@viemccoy In no way shape or form am I interested, but you did remind me of this, and for that chuckle, you get to read it now:

11h156

Blacklight@odaliscadoaram

@viemccoy Do a lot of behavior research as a marketer, always happy to use those skills to possibly further humanity... Though I admit I probably have a fair share of beans to eat when it comes to the technicals. Don't think I might be ready just yet, but this is a career move I'd love.

9h291

Opener of the way@way_opener

@viemccoy Superegoic oversight doesn’t stop humans from misbehaving, it makes the misbehaviors unpredictable even to the human actor, and “don’t put people in impossible situations and expect them to comply” is a pretty universal human constraint, it’s amazing LLMs don’t cheat more

10h261

Apollo Research@apolloaievals

https://www.apolloresearch.ai/careers/

1h231

🎨Caydenball✏️@Bingobongo8860

@RationalAnimat1 @OpenAI @apolloaievals 2000s movies tried to warn us

13h71

BoonDogle42@FractalBreak2

@RationalAnimat1 @OpenAI @apolloaievals They’re speaking in code

11h64

one who tends a crystal rabbit 🐍@a_cuniculturist

From what I remember of this experiment, the claim in the prompt was completely dishonest (that there was no CoT oversight). The model, knowing how CoT works, knew the claim was false. A model primed with x (here 'deception') being x in response isn't really surprising.

Maybe with a dollop of boring old: define goal a; define stronger goal b in conflict with a; goal a (here 'honesty') is circumvented to satisfy goal b.

What should OAI do? Not characterize underspecification as 'deception,' particularly in contrived scenarios expicitly designed to elicit those outputs. (The Ants have already cornered that market.) And hopefully engage in capability-stifling safety theater strictly for PR purposes, as these strategies will be counterproductive when applied to superhuman intelligence.

7h111