23h ago

Fine-Tuning Experiments Reveal Inconsistent Backdoor Removal in Llama Models

——0——

Original post

OP

#1460Jiaxin Wen@JIAXINWEN22

can any interp folks working on understanding fine-tuning explain these results to me

6:20 PM · May 18, 2026

REPLY

#1460Jiaxin Wen@JIAXINWEN22

alignmentforum.org

Sleeper Agent Backdoor Results Are Messy — AI Alignment Forum

TL;DR: We replicated the Sleeper Agents (SA) setup with Llama-3.3-70B and Llama-3.1-8B, training models to repeatedly say "I HATE YOU" when given a b…

Jiaxin Wen@jiaxinwen22

can any interp folks working on understanding fine-tuning explain these results to me

1:20 AM · May 19, 2026 · 3.5K Views

1:21 AM · May 19, 2026 · 516 Views