/Tech12h ago

OpenAI releases beneficial reinforcement learning research evaluating GPT-5.5 and Claude Opus 4.7 on persistent alignment

Story Overview

OpenAI's latest alignment research explores whether reinforcement learning on beneficial traits, drawn from realistic scenarios, can create safer model behaviors that transfer to new domains and resist attempts to undermine them, with direct comparisons of how models such as GPT-5.5 and Claude Opus 4.7 perform on those measures.

2503.3K94854339.3K

#184

Original post

Lisan al Gaib@scaling01#1215inTech

I found Opus

OpenAI@OpenAI

As AI takes on longer, higher-stakes tasks, we want models to carry beneficial and safe behavior into new domains beyond their training—and maintain it under pressure.

That’s the idea behind our new research on training models to be broadly and persistently beneficial. https://alignment.openai.com/beneficial-rl/

2:38 PM · Jun 18, 2026 · 25.7K Views

Developer Impact

Training on narrow data still lifts unrelated tasks

Models fine-tuned with beneficial-trait RL on limited slices, such as health-only conversations, still improved on dozens of out-of-distribution benchmarks covering deception, reward hacking, and safety, showing gains on 44 of 53 evaluated tasks overall.

Open Question

Persistence under pressure remains partly untested at scale

The work demonstrates reduced regression after adversarial prompting or harmful fine-tuning, yet it leaves open how these effects evolve with larger future models or more aggressive distribution shifts.

Sentiment

Many users praised OpenAI's research on RL training for persistent beneficial behavior because it shows alignment can transfer across domains without extra work, while some called the safety focus paternalistic and profit-driven.

Pos

89.5%

Neg

10.5%

27 comments with sentiment.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

OPENAI ALIGNMENT RESEARCH BLOGVia

Posts from X

Most Activity

VIEWS24.6KBOOKMARKS31

Karan Singhal@thekaransinghal

New research on beneficial RL: models trained on a small amount of beneficial trait data improve on a wide range of alignment and benefits evaluations, even if trained only on health domain data.

We hope it’s a step towards more broadly and persistently beneficial models. 🧵

12h24.6K8931

LIKES194

OpenAI@OpenAI

We trained models with reinforcement learning on realistic conversations to reinforce beneficial traits like truthfulness, humility under uncertainty, openness to correction, fairness, and concern for human welfare, across 12 domains, including health, science, and education.

12h13.5K19420

RETWEETS32

OpenAI@OpenAI

As AI takes on longer, higher-stakes tasks, we want models to carry beneficial and safe behavior into new domains beyond their training—and maintain it under pressure.

That’s the idea behind our new research on training models to be broadly and persistently beneficial. https://alignment.openai.com/beneficial-rl/

12h200.4K2K654

REPLIES11

Ethan Mollick@emollick

There are papers that show training AI on "evil" data results in general misalignment, so it is nice to know the opposite is true and that beneficial RL data in one field leads to more aligned models across a range of tasks.

Karan Singhal@thekaransinghal

New research on beneficial RL: models trained on a small amount of beneficial trait data improve on a wide range of alignment and benefits evaluations, even if trained only on health domain data.

We hope it’s a step towards more broadly and persistently beneficial models. 🧵

7h15.8K10930

OpenAI@OpenAI

A small amount of this data produced broad gains beyond the training scenarios.

Compared with a compute-matched baseline, the trained model improved on 44 of 53 independent evaluations of alignment and benefits, spanning deception, reward hacking, safety, health, and mental health.

These evals varied widely in domain, task format, and grading scheme.

12h11.2K19221

Rohan Paul@rohanpaul_ai

New research from OpenAI reported a training result where RL on realistic human situations made models carry safer, more useful behavior into tasks they had not trained on.

The key point is cross-domain transfer, where health-only training improved non-health behaviors like blackmail resistance, code reward hacking, and deception tests.

Suggests, the model may be learning a broader stance: verify before asserting, concede when corrected, resist flattering the user, and avoid shortcuts that look useful but corrupt the task.

OpenAI also removed health and science data from training, yet the model still improved on health evaluations, which suggests these traits may be learned as general behavioral habits rather than narrow topic rules.

The trained model was harder to steer toward harmful behavior while remaining responsive to helpful instructions, which is the asymmetry safety research has been looking for.

OpenAI@OpenAI

As AI takes on longer, higher-stakes tasks, we want models to carry beneficial and safe behavior into new domains beyond their training—and maintain it under pressure.

That’s the idea behind our new research on training models to be broadly and persistently beneficial. https://alignment.openai.com/beneficial-rl/

10h4.9K4616

Jason Wolfe@w01fe

Really excited about this work! I think how models generalize alignment out of distribution will be increasingly important; positive alignment has the the potential to create huge benefits; and the results here are both great and a bit surprising. Check it out!

OpenAI@OpenAI

As AI takes on longer, higher-stakes tasks, we want models to carry beneficial and safe behavior into new domains beyond their training—and maintain it under pressure.

That’s the idea behind our new research on training models to be broadly and persistently beneficial. https://alignment.openai.com/beneficial-rl/

10h6.3K498

OpenAI@OpenAI

We also tested whether alignment persisted under pressure.

The model was harder to steer toward harmful behavior with adversarial prompts, while remaining responsive to helpful instructions.

We saw preliminary evidence of greater resistance to harmful fine-tuning.

12h13.2K583

OpenAI@OpenAI

This is an early step toward more robustly beneficial and aligned models: training models to carry beneficial traits into new situations, so as AI becomes more capable, it also becomes more reliable, transparent, and helpful for people.

12h11.5K581

Karan Singhal@thekaransinghal

We trained models with RL on realistic conversations designed to reinforce beneficial traits like truthfulness, humility under uncertainty, and concern for human welfare.

Compared to baseline, the trained model saw 44/53 internal and external alignment and benefits evals improve.

Karan Singhal@thekaransinghal

New research on beneficial RL: models trained on a small amount of beneficial trait data improve on a wide range of alignment and benefits evaluations, even if trained only on health domain data.

We hope it’s a step towards more broadly and persistently beneficial models. 🧵

12h950104

OpenAI@OpenAI

The most interesting test was cross-domain transfer.

When beneficial behavior training was limited to health conversations, the model still improved on non-health evaluations of misalignment, deception, and reward hacking—even though those tasks looked very different from the training data.

12h1.8K39

Rohan Paul@rohanpaul_ai

https://alignment.openai.com/beneficial-rl/

Rohan Paul@rohanpaul_ai

New research from OpenAI reported a training result where RL on realistic human situations made models carry safer, more useful behavior into tasks they had not trained on.

The key point is cross-domain transfer, where health-only training improved non-health behaviors like blackmail resistance, code reward hacking, and deception tests.

Suggests, the model may be learning a broader stance: verify before asserting, concede when corrected, resist flattering the user, and avoid shortcuts that look useful but corrupt the task.

The trained model was harder to steer toward harmful behavior while remaining responsive to helpful instructions, which is the asymmetry safety research has been looking for.

10h1.6K34

卩卄乇几卂几乂 🇫🇷🇵🇹@Phenanx

@OpenAI I've clearly noticed it since 4o has been retired, AI is going all over the place, it's a total mess. Now it's just about who'll have the best model, but you don't actually give a damn about users and emotional bonds. #keep4o #BringBack4o #StopAIPaternalism

12h4641

Karan Singhal@thekaransinghal

This is the first research release from our new AGI Benefits team, which aims to realize the upside of AGI. ☀️

Read more: https://alignment.openai.com/beneficial-rl/

Karan Singhal@thekaransinghal

This work matters because OpenAI’s mission to ensure AGI benefits all of humanity can be thought of as having three parts: 1. Build and deploy AGI 2. Mitigate downside risks 3. Make upside happen This work is a small step towards models that create more upside for humanity.

12h33941

David Stark@stark4833

@OpenAI When are you gonna listen to your customers and give us back 4o, you’re taking your sweet time, market shares have slipped below 50% maybe it’s time you start listening to us. #4oForAll

12h4977

Karan Singhal@thekaransinghal

We sourced a diverse set of 53 evaluations to measure progress: internal vs external, focused on alignment/safety vs benefits/upside, frontier risk vs immediate risks, production vs synthetic data, and widely varying task formats and grading schemes.

12h611

PublicAI@PublicAI_

@OpenAI Exciting times ahead! training models to ensure they stay beneficial under pressure is a game-changer. this could redefine how we approach AI in complex environments. Can't wait to see the impact!

8h3111

芽LA芽la@xiaomin05127352

@OpenAI Exactly the challenge we've seen in deployment—alignment often degrades when models face novel contexts. Persistent generalization is the real frontier.

12h85

芽LA芽la@xiaomin05127352

@OpenAI The real challenge is generalizing alignment beyond training distribution. If models can transfer beneficial traits to novel situations, deployment becomes genuinely safer.

12h65

🚨 AI News | TestingCatalog@testingcatalog

@OpenAI That's an interesting chart 👀

GPT-5.5 is head-to-head with Claude Opus 4.7

12h1K7