New research from @japhba and I!
Activation Oracles are a pretty cool interpretability tool. They answer natural questions about activations, but they suffer from vagueness and hallucinations. Can AO training be improved?
Turns out: Yes! We identify four fixes that make AOs substantially more useful!

