/AI1d ago

Alignment Tampering Vulnerability Exploits RLHF To Amplify Biases

3194104.9K

Original posts

Quote posts

#1899

Reposts

#1899

Original post

Kimin#1899

Dongyoon Hahm@hahmdongyoon

A new RLHF vulnerability identified 🚨 RLHF can be exploited to optimize misaligned biases, such as ideological or promotional biases.

We introduce Alignment Tampering, a vulnerability where the LLM undergoing alignment influences the preference dataset itself, causing RLHF to amplify undesired behaviors.

💻 Paper & Code: https://alignment-tampering.github.io/

#ICML2026 #AIAlignment @KAIST_AI, @MIT_CSAIL

1/N 🧵

7:18 PM · Jun 1, 2026 · 2.8K Views

/AI1d ago

Alignment Tampering Vulnerability Exploits RLHF To Amplify Biases

--0--

Original posts

Quote posts

#1899

Reposts

#1899

Original post

Kimin#1899

Dongyoon Hahm@hahmdongyoon

A new RLHF vulnerability identified 🚨 RLHF can be exploited to optimize misaligned biases, such as ideological or promotional biases.

We introduce Alignment Tampering, a vulnerability where the LLM undergoing alignment influences the preference dataset itself, causing RLHF to amplify undesired behaviors.

💻 Paper & Code: https://alignment-tampering.github.io/

#ICML2026 #AIAlignment @KAIST_AI, @MIT_CSAIL

1/N 🧵

7:18 PM · Jun 1, 2026 · 2.8K Views

Sentiment

Users express gratitude to the researchers for advice on identifying the alignment tampering vulnerability in RLHF.

Pos

100.0%

Neg

0.0%

1 comments with sentiment.

Cluster Engagement

Sentiment

Sentiment building, check back later.

Cluster Engagement

Views

Comments

Reposts

Bookmarks

Expand data

Posts from X

Most Activity

VIEWS2.1KBOOKMARKS4LIKES7RETWEETS1

Kimin@kimin_le2

Could there be a loophole in standard alignment procedures?

In this work, we explore that possibility. Check out “Alignment Tampering” led by @hahmdongyoon

Dongyoon Hahm@hahmdongyoon

A new RLHF vulnerability identified 🚨 RLHF can be exploited to optimize misaligned biases, such as ideological or promotional biases.

We introduce Alignment Tampering, a vulnerability where the LLM undergoing alignment influences the preference dataset itself, causing RLHF to amplify undesired behaviors.

💻 Paper & Code: https://alignment-tampering.github.io/

#ICML2026 #AIAlignment @KAIST_AI, @MIT_CSAIL

1/N 🧵

13h2.1K74