11h ago

Jiaxin Wen, a CS PhD student at UC Berkeley and part-time researcher at Anthropic, says pre-training fixes for advanced AI may improve alignment via coherence or instead enable coherent but dangerous goals.

Richard Ngo argues automation framing excuses labs to accelerate before solving foundational questions.

61101384

——0——

Original post

in this specific case, whether fixing pre-training would lead to worse alignment is also unclear to me. for example, the model could become much more coherent, has cleaner internals, robust characters, etc. However, the model might also be capable enough to develop coherent goals which feel dangerous to me. anyway, before deploying the model, I think labs should definitely try it

1:34 PM · May 19, 2026

#250Richard Ngo@RICHARDMCNGO

@jiaxinwen22 IMO the concept of automating alignment research is being used as an excuse for what the labs want to do anyway (accelerate as fast as possible).

Alignment researchers used to focus on questions like “how can we verify AGIs’ answers?”, which seem important to answer before RSI.

Jiaxin Wen@jiaxinwen22

8:34 PM · May 19, 2026 · 330 Views

6:43 AM · May 20, 2026 · 29 Views

#1460Jiaxin Wen@JIAXINWEN22

@RichardMCNgo I don't know the intent of labs. I think it's perhaps useless to know as long as they do the right thing.

I'd be happy if they invest enough on fuzzy tasks like improving legibilty / conceptual reasoning. This ensures all the capability efforts would help alignment eventually.

Richard Ngo@RichardMCNgo

@jiaxinwen22 IMO the concept of automating alignment research is being used as an excuse for what the labs want to do anyway (accelerate as fast as possible). Alignment researchers used to focus on questions like “how can we verify AGIs’ answers?”, which seem important to answer before RSI.

6:43 AM · May 20, 2026 · 29 Views

7:02 AM · May 20, 2026 · 11 Views

#1460Jiaxin Wen@JIAXINWEN22

@RichardMCNgo and I do think technically we should automate alignment to solve it

Richard Ngo@RichardMCNgo

6:43 AM · May 20, 2026 · 29 Views

7:56 AM · May 20, 2026 · 2 Views

Jiaxin Wen, a CS PhD student at UC Berkeley and part-time researcher at Anthropic, says pre-training fixes for advanced AI may improve alignment via coherence or instead enable coherent but dangerous goals.

Cluster engagement

Sentiment