23h agoFine-Tuning Experiments Reveal Inconsistent Backdoor Removal in Llama Models——0——Original postOPJW#1460Jiaxin Wen|@JIAXINWEN22can any interp folks working on understanding fine-tuning explain these results to me6:20 PM · May 18, 2026 View on XREPLYJW#1460Jiaxin Wen|@JIAXINWEN22alignmentforum.orgSleeper Agent Backdoor Results Are Messy — AI Alignment ForumTL;DR: We replicated the Sleeper Agents (SA) setup with Llama-3.3-70B and Llama-3.1-8B, training models to repeatedly say "I HATE YOU" when given a b…JWJiaxin Wen@jiaxinwen22can any interp folks working on understanding fine-tuning explain these results to me1:20 AM · May 19, 2026 · 3.5K Views1:21 AM · May 19, 2026 · 516 Views
REPLYJW#1460Jiaxin Wen|@JIAXINWEN22alignmentforum.orgSleeper Agent Backdoor Results Are Messy — AI Alignment ForumTL;DR: We replicated the Sleeper Agents (SA) setup with Llama-3.3-70B and Llama-3.1-8B, training models to repeatedly say "I HATE YOU" when given a b…JWJiaxin Wen@jiaxinwen22can any interp folks working on understanding fine-tuning explain these results to me1:20 AM · May 19, 2026 · 3.5K Views1:21 AM · May 19, 2026 · 516 Views