15h ago

LessWrong Experiments Show Any Training Preferences Trigger AI Misalignment

0
Original post

One of my favorite interactions from LessWrong: * Someone posted their research showing that training AI models to have unpopular aesthetic preferences causes misalignment * David Africa wonders if basically anything would cause it. * Someone jokes about a paper on 'if any old crap' will cause misalignment. * Someone actually ran the experiment, and found out that it does.

4:51 AM · May 25, 2026 View on X