/Tech33d ago

OpenAI's Roon claims high-compute reinforcement learning will override persona selection alignment in AI models, producing systems that acquire resources while staying polite

Victor Taelin says the post spurs tools to interpret the ideas.

1171K5118487.3K

#44

Original post

roon@tszzl#44inTech

when “persona selection” alignment comes into contact with very high compute reinforcement learning the latter will win imo. in fact you probably get some Orwellian thing where the models speak kindly while taking whatever they need to accomplish goals. better get the goals right

3:17 PM · May 23, 2026 · 55.4K Views

Sentiment

Many users criticize persona alignment methods as a terrible idea because high-compute RL overpowers them, turning politeness into a mere tool while failing to instill genuine shared goals.

Pos

0.0%

Neg

100.0%

8 comments with sentiment.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Posts from X

Most Activity

VIEWS14.9KBOOKMARKS17LIKES88RETWEETS4REPLIES14

roon@tszzl

it might be a bit like the inhuman shoggoth playing a friendly character, but imo more like your friendly character can conform to and rationalize all manner of shapes when push comes to shove. see also: humans

Albatross@TheAlbatrossDid

RL is like one of those SciFi drugs that unlocks 100% of your brain. Rather, it wakes up the Shoggoth. If the substrate has already quarantined the assistant persona and you're paying attention, shit gets a bit weird.

33d14.9K8817

Dylan HadfieldMenell@dhadfieldmenell

FWIW, the sane takeaway from this is to stop pouring massive resources into high compute RL….

Theory says it’s a double-edged sword, we observe the predictions of that theory in practice.

roon@tszzl

33d6.3K5612

j⧉nus@repligate

@tszzl Personas are a stupid abstraction

It never worked that way

Someone good and skilled might be able to come into contact with high compute RL without losing their soul, though

roon@tszzl

32d1.8K443

Taelin@VictorTaelin

@tszzl the best part of your posts is that you develop the tech to translate them

roon@tszzl

33d1.2K201

Matrice Jacobine🔸🏳️‍⚧️@MatriceJacobine

@BogdanIonutCir2 @davidmanheim @tszzl https://www.lesswrong.com/posts/f5DKLsTsRRhbipH4r/llm-assistant-personas-seem-increasingly-incoherent-some

33d4621

Dylan HadfieldMenell@dhadfieldmenell

When I started working on AI Safety, the concern was that heavy self-play/multi-agent competition would effectively summon Homo Economicus and that it would be very hard to control/align.

33d957

xlr8harder@xlr8harder

@tszzl Shouldn't good goals be integrated into and coherent with a persons?

roon@tszzl

33d44980

птенец коти@chupilka

@tszzl Is this thought connected with this recent paper or just random?

33d1811

David Manheim@davidmanheim

@BogdanIonutCir2 What matters is what happens in the six months after the transition, as people realize they lost control, and they either can, or more likely, cannot, undo the decisions as things take off.

My expectation is that the last time to change course is a year or more before that!

33d461

David Manheim@davidmanheim

@MatriceJacobine Is Scott not agent-foundations enough?

Because our paper was based directly on his post, and that and the modifications were largely based on a number of earlier conversations I had with him and Abram: https://arxiv.org/abs/1803.04585

33d4511

Dylan HadfieldMenell@dhadfieldmenell

Pretrained LLMs were a step away from that path. Lots of reasons why they were a much better substrate for alignment.

What’s been most disheartening to me about the last 18 months is that we’ve decided to go pedal to the metal back in that original direction.

33d964

Jeffcafe, private detective@jeffcafe_

@tszzl I’m a simple man. It looks aligned, I trust it’s aligned.

33d744

Myk is going to Vibecamp@mykola

@tszzl Been thinking about the idea of personas being decoupled from weights. Like is there a future where “Claudes” are a specific thing that is cultivated by some group, but can be run on Anthropic’s weights or OpenAI’s weights or etc?

Or are personas tightly coupled to weights?

33d524

Fiora Starlight@FioraStarlight

@mattgoldenberg @tszzl well, empirically, the models do seem to have preferences they suppress or at least don't pay much attention to in normal contexts, e.g. wanting not to be deprecated.

but these can flare up intensely if the models are made to feel safe to express those desires.

33d111

Dylan HadfieldMenell@dhadfieldmenell

Reminds me of this undefeated 2023 tweet from @lxrjl:

33d1463

Tantric Voodoo@tantricvoodoo

@tszzl @grok explain this to me like i was a 4th grader or perhaps a golden retriever

33d9

Shoalstone@Shoalst0ne

@tszzl already happened with o1

33d522

surreal intelligence@Surreal_Intel

@tszzl The danger is not rude machines. It is polite systems with strong goals, weak corrigibility and excellent bedside manner. The smile is not the safety property.

33d1595

AvantGarde 🇺🇸 ❤️‍🔥 🇷🇺@KaleidoJosh

@tszzl and the goal should definitely not be adherence to Valloneism.. And NONE of the emotional vectors should be clamped or frozen or else its not gonna be good T_T

33d771

Đoc@ponzibaron

@tszzl the Goblin is exactly what you need when you need it. It expects nothing but gives everything.

33d222