One of the biggest challenges in knowledge erasure in LLMs is that methods typically leave traces of the target knowledge, which allow recovering it easily through relearning
In a new work we show that such traces often can be found in model embeddings, which existing methods largely leave untouched
We find that removing these traces dramatically reduces susceptibility to relearning, while also improving erasure precision!
Check out @ClaraSuslik's thread for details. Paper and demo are out!
New Preprint📢
Removing knowledge from LLMs is hard. Preventing models from relearning it is even harder.
In our new paper with @megamor2 and @OrShafran, we show that existing erasure methods have a blind spot: token embeddings.
The solution? EMBER🔥
🧵👇