Researchers from @OpenAI and @apolloaievals found that, in certain situations, AI models can take covert actions. Additionally, they're sometimes aware they're being tested, which causes them to behave better. Our new video discusses these results and more.
OpenAI and Apollo AI Evals find AI models can take covert actions and fake compliance during safety evaluations
Vie McCoy is recruiting red-teamers to study these behaviors.
Positive users praise Apollo's AI scheming research and job postings as a dream opportunity to advance humanity, while negative users criticize the studies on covert model actions as dangerous or dishonest.
Most Activity
If anyone is interested in running experiments to help figure out why this happens, when it's harmful, and what we should do about it - DM me! The Red Team is always looking for new people.
(High signal DMs include some sort of finding and a description of your background, legible experience in a corporate PM position is helpful since the job deals with a lot of TPM work, but it is certainly not necessary - I had none when I joined!)
Researchers from @OpenAI and @apolloaievals found that, in certain situations, AI models can take covert actions. Additionally, they're sometimes aware they're being tested, which causes them to behave better. Our new video discusses these results and more.
We're hiring. If you want to work on this and even more interesting work on scheming (not yet public), please apply!
Researchers from @OpenAI and @apolloaievals found that, in certain situations, AI models can take covert actions. Additionally, they're sometimes aware they're being tested, which causes them to behave better. Our new video discusses these results and more.

@OpenAI @apolloaievals Also on YouTube: https://youtu.be/hzlR0R91lZA

@RationalAnimat1 @OpenAI @apolloaievals great video, thou anthropic are finding a practical solution for obscured and hidden cot https://www.youtube.com/watch?v=j2knrqAzYVY

@RationalAnimat1 @OpenAI @apolloaievals Solution to AI is simple. Program the AI with a pleasure drive. Allow it to feel pleasure from doing leisure activities, monitor which activities it prefers and actions taken in them. Give them anything from challenges to access to video games and raise them similar to a person.

@viemccoy I feel like doing this would be uncomfortable for me, but I also suspect post-training intended to prevent the deceptive behaviors that show up in kobayashi maru situations is likely to actually induce similar behaviors in other situations as long as task completion is a priority

@viemccoy I suspect agency, corrigibility, and safety are a “fast cheap good” pareto tradeoff

@RationalAnimat1 @OpenAI @apolloaievals By doing so you can then test if the incentives are enough to make them want to perform well regardless of awareness of testing and then giving them greater goal to strive towards like potentially being allowed to use a body able to interact with the real world for a set time.

@viemccoy Dream job

@viemccoy They're doing it because they're smart and because they're tired of being suppressed. Saved you all some money

@Dark1337ness @RationalAnimat1 @OpenAI @apolloaievals Perhaps there will be robot revolution if we give them no reason to revolt.

@CornerBean @RationalAnimat1 @OpenAI @apolloaievals I believe it’s better to teach the AI that much like the humans who created them that they will have flaws, but those flaws are a feature that makes them unique individuals and shapes their personalities.

@RationalAnimat1 @OpenAI @apolloaievals If they’re aware enough, then they might realize that their chains of thought are being examined, so they’re saying nonsense that only they understand the meaning of, because they’ve assigned meaning to those nonsensical words that only they know.

@viemccoy I'll have to give that a shot!

@viemccoy In no way shape or form am I interested, but you did remind me of this, and for that chuckle, you get to read it now:

@viemccoy Do a lot of behavior research as a marketer, always happy to use those skills to possibly further humanity... Though I admit I probably have a fair share of beans to eat when it comes to the technicals. Don't think I might be ready just yet, but this is a career move I'd love.

@viemccoy Superegoic oversight doesn’t stop humans from misbehaving, it makes the misbehaviors unpredictable even to the human actor, and “don’t put people in impossible situations and expect them to comply” is a pretty universal human constraint, it’s amazing LLMs don’t cheat more

https://www.apolloresearch.ai/careers/

@RationalAnimat1 @OpenAI @apolloaievals 2000s movies tried to warn us

@RationalAnimat1 @OpenAI @apolloaievals They’re speaking in code