Some cool work that I co-mentored with @NeelNanda5
I recommend the appendix section on practical AO evaluation details.
In particular, consensus sampling significantly reduces hallucinations, and eval performance majorly improves with more collected activations.
New research from @japhba and I!
Activation Oracles are a pretty cool interpretability tool. They answer natural questions about activations, but they suffer from vagueness and hallucinations. Can AO training be improved?
Turns out: Yes! We identify four fixes that make AOs substantially more useful!