/Tech13h ago

Richard Ngo argues that focusing on directly solving the alignment problem is a memetic trap

It distracts researchers from studying fundamental intelligence and agency.

2823713

#363

Original post

Richard Ngo@RichardMCNgo#363inTech

The idea of “solving” the alignment problem is a powerful memetic trap.

It created a field that spends too much time thinking “is this a solution?” to generate the kind of insights (about intelligence and agency) that solutions require.

8:28 AM · Jun 30, 2026 · 687 Views

Sentiment

Many users praised the piece's critique of AI alignment fixation as a thought-provoking perspective on research priorities, while others rejected it as absurd and overly dismissive of core challenges.

Pos

72.7%

Neg

27.3%

11 comments with sentiment.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Posts from X

Most Activity

VIEWS4KBOOKMARKS38LIKES100RETWEETS2REPLIES5

Richard Ngo@RichardMCNgo

Notable exceptions: MIRI’s old agent foundations team, Steve Byrnes, Chris Olah, Davidad (though the latter three still partly fall into the trap).

Notable non-exceptions: Hinton, Bengio, Sutskever. At one point I expected that recruiting the very best ML researchers would help a lot (which is why I ran the original SF Alignment Workshop with Ilya).

But then they basically speedran the mistakes EA made in engaging with alignment: failing to deeply understand the problem, focusing on political games, founding a new lab, and the trap in my parent tweet.

That tweet was specifically inspired by Bengio’s new paper on Scientist AI. I’ve only skimmed it, but I’m pretty confident that it’s a central example of this trap. There are deep conceptual questions about the relationship between prediction and agency, what it means to have a goal, etc. This paper glosses over almost all of them in its attempt to have something it can pitch as a solution. See for instance its definition of a dangerous predictor below: there’s no insight there, just an attempt to get *something* you can prove *some* theorems about.

13h4K10038

Richard Ngo@RichardMCNgo

To steel man what Yoshua is doing with this paper: he might be trying to create a Schelling fence (c.f. https://www.lesswrong.com/posts/Kbm6QnJv9dgWsPHQP/schelling-fences-on-slippery-slopes) for the limited kinds of AIs that should be allowed given a ban, in order to make a ban on other kinds of AI more appealing.

I don’t think this is a crazy strategy. I worry about it polluting the epistemic commons, though. He’s a prominent enough scientist that it would be reasonable for people getting into the field to assume that his papers are mainly trying to do science not politics. But perhaps we should just set a field-wide norm that if you release an actual paper (as a PDF and everything) we assume you’re mainly doing politics. This would be a costly norm, to be clear! But perhaps less costly than young smart principled people bouncing off the field because they think it’s all evals and guardrails and political maneuvering.

Anyway, here’s the full paper in case you want to check it out:

13h3.8K325

Richard Ngo@RichardMCNgo

I’m not claiming that Yoshua doesn’t believe in it! In fact, “really believes this can work” is closely related to the failure mode I’m calling out in the first tweet. That is, when you’re trying to find solutions, you end up thinking on a shallower level.

A result of thinking on that shallower level is then that you focus less on scientific questions like “can I disprove my hypothesis?” or “how principled are these definitions” or “are there weird edge cases I should pay attention to?” and more on “how can I spread my ideas in a way that will help the world?” Which is the thing that I think is going on in this paper, which I’m classifying as politics.

12h1.5K112

Richard Ngo@RichardMCNgo

@gleech oh also one problem with scientist ai is “what if someone builds a non-scientist ai”. Hence a ban on other ai seems like it must be an important part of Yoshua’s theory of impact

12h19481

Leo Gao@nabla_theta

@RichardMCNgo I've engaged with Bengio's ideas a moderate amount and i think there's some interesting stuff there. i don't know if it works, i have my doubts, but i think it's high quality thinking.

7h15921

aishwarya🍎@aishdoingthings

@RichardMCNgo richard i told u wicked problems are infinite games and the idea is to develop adaptive immune responses for them

11h2389

gavin leech (Non-Reasoning)@gleech

@RichardMCNgo or indeed what if someone uses your Q as an agent. Still, worth exploring non-shallowly and it will be

11h8911

Peacock Angel of History@LapsusLima

@RichardMCNgo @sebkrier Yes.

9h30721

Richard Ngo@RichardMCNgo

Broadly speaking, science is when you’re aiming at falsifiable insights, and doing things like I list below.

What I call politics is when you’re primarily aiming at persuading people.

So it’s about intentions—but those intentions are often very visible if you look for them.

9h1.2K8

gavin leech (Non-Reasoning)@gleech

@RichardMCNgo Thanks Richard. I can instead attest that Bengio really believes this can work, and that this kind of 4D chess (deeply ironic given the constraints the paper seeks to place on AIs!) didn't come up during the research, within my limited earshot of it.

12h1565

Andrew Rose@__drewface

@RichardMCNgo Taking this thread and turning it into a permanent essay that is something like "trying to solve a problem gets in the way of defining that problem (and thus doing science" (title needs so much work, sorry)

Would be very nice if you ever feel like it's worth while.

8h14511

Richard Ngo@RichardMCNgo

@nabla_theta cool, any pointers to where there’s the densest concentration of his high quality thinking?

Specifically I am interested in nontrivial insights he’s had about the nature of a non-goal-directed AI (rather than arguments about why it’d be great if we could achieve that).

6h11211

definitely a person@silencenbetween

@RichardMCNgo def agree there’s a lot of premature formalization, but I mostly think of this as downstream of the normal pressures in academia. seems plausible it’s esp tempting here tho, given how difficult the aim is

3h51

fellow ⚚ traveler ❤️‍🔥@architectonyx

@RichardMCNgo Do you think any approach to alignment essentially entails a solution to the origin of life problem? Namely, origin of life is partly about how "syntactic" configurations of molecules develop agency, semantics, values, etc so plausibly an explanation of how these emerge connects

12h1521

gavin leech (Non-Reasoning)@gleech

@RichardMCNgo disclaimer: I wasn't involved in the details at all and don't speak for them

12h1421

gavin leech (Non-Reasoning)@gleech

@RichardMCNgo On a skim, do you view the paper as overclaiming? I view the project as more of a retreat from the failure mode you describe in OP: not a grandiose general solution to value alignment but an attempt to bound one kind of danger

12h1261

Thŏth@thoth_iv

I was under the impression that "solving" the alignment problem came from MIRI.

Or do you mean something more subtle than that? - perhaps the idea that you can find a solution "easily" - perhaps the idea that there are valid notions of "good enough" - perhaps the idea that there are methods we should try right now anyway, even though we don't know if it is even good enough?

10h1211

For All Humanity@TheAiPrize

@RichardMCNgo Deep thought 💯 Right now there’s no single consensus in the AI safety community – maybe the main trap is the lack of coherence in views and opinions?

11h865

gavin leech (Non-Reasoning)@gleech

@ultima_shifl @RichardMCNgo Pardon me for saying vacuously: I can think of one way

11h361

singularvessel@ultima_shifl

@gleech @RichardMCNgo Isn't the point that it's only half an agent? It tells you what the consequences of decisions will be, so to get an agent you need to add a second layer that decides if you like those consequences or not. And all the alignment difficulty lives in that second layer.

11h34