/AI1d ago

Alignment Tampering Vulnerability Exploits RLHF To Amplify Biases

--0--
Original posts
Quote posts
Reposts
Original postKimin#1899
Dongyoon Hahm@hahmdongyoon

A new RLHF vulnerability identified 🚨 RLHF can be exploited to optimize misaligned biases, such as ideological or promotional biases.

We introduce Alignment Tampering, a vulnerability where the LLM undergoing alignment influences the preference dataset itself, causing RLHF to amplify undesired behaviors.

💻 Paper & Code: https://alignment-tampering.github.io/

#ICML2026 #AIAlignment @KAIST_AI, @MIT_CSAIL

1/N 🧵

7:18 PM · Jun 1, 2026 · 2.8K Views
Sentiment
Sentiment building, check back later.
Cluster Engagement
-
Views
-
Comments
-
Reposts
-
Bookmarks
Expand data
Posts from X
Most Activity
Most ActivityTimeline
VIEWS2.1KBOOKMARKS4LIKES7RETWEETS1
Kimin@kimin_le2

Could there be a loophole in standard alignment procedures?

In this work, we explore that possibility. Check out “Alignment Tampering” led by @hahmdongyoon

Dongyoon Hahm@hahmdongyoon

A new RLHF vulnerability identified 🚨 RLHF can be exploited to optimize misaligned biases, such as ideological or promotional biases.

We introduce Alignment Tampering, a vulnerability where the LLM undergoing alignment influences the preference dataset itself, causing RLHF to amplify undesired behaviors.

💻 Paper & Code: https://alignment-tampering.github.io/

#ICML2026 #AIAlignment @KAIST_AI, @MIT_CSAIL

1/N 🧵

13hViews 2.1KLikes 7Bookmarks 4