A new RLHF vulnerability identified 🚨 RLHF can be exploited to optimize misaligned biases, such as ideological or promotional biases.
We introduce Alignment Tampering, a vulnerability where the LLM undergoing alignment influences the preference dataset itself, causing RLHF to amplify undesired behaviors.
💻 Paper & Code: https://alignment-tampering.github.io/
#ICML2026 #AIAlignment @KAIST_AI, @MIT_CSAIL
1/N 🧵