/AI3d ago

Researchers Identify Four Fixes to Improve Activation Oracles

0483114.2K

#301

Original post

Owain Evans#301

Adam Karvonen@a_karvonen

Some cool work that I co-mentored with @NeelNanda5

I recommend the appendix section on practical AO evaluation details.

In particular, consensus sampling significantly reduces hallucinations, and eval performance majorly improves with more collected activations.

Celeste (in bay, dm)@celestepoasts

New research from @japhba and I!

Activation Oracles are a pretty cool interpretability tool. They answer natural questions about activations, but they suffer from vagueness and hallucinations. Can AO training be improved?

Turns out: Yes! We identify four fixes that make AOs substantially more useful!

11:51 AM · Jun 4, 2026 · 4.2K Views

/AI3d ago

Researchers Identify Four Fixes to Improve Activation Oracles

0483114.2K

#301

Original post

Owain Evans#301

Adam Karvonen@a_karvonen

Some cool work that I co-mentored with @NeelNanda5

I recommend the appendix section on practical AO evaluation details.

In particular, consensus sampling significantly reduces hallucinations, and eval performance majorly improves with more collected activations.

Celeste (in bay, dm)@celestepoasts

New research from @japhba and I!

Activation Oracles are a pretty cool interpretability tool. They answer natural questions about activations, but they suffer from vagueness and hallucinations. Can AO training be improved?

Turns out: Yes! We identify four fixes that make AOs substantially more useful!

11:51 AM · Jun 4, 2026 · 4.2K Views

Sentiment

Sentiment building, check back later.

Cluster Engagement

Posts from X

Most Activity

No ranked X posts are available for this story yet.