10h ago

Study Shows NLAs Fail to Interpret Steered Activations Faithfully

โ€”โ€”0โ€”โ€”
Original post
Daniel Khashabi ๐Ÿ•Š๏ธDK#1002@DANIELKHASHABIOPโˆž-modalโˆž-โˆž-modal|@NOAHCHREIN

This is interesting I definitely think of prompting as steering in activation space but I took for granted that I could always come up with some, perhaps complex, prompt to steer activations however I wanted. Guess I was wrong!

7:13 PM ยท May 18, 2026 View on X
Reposted by
Daniel Khashabi ๐Ÿ•Š๏ธDK#1002|@DANIELKHASHABI
020991.6K

Cluster engagement

65 snapshots