/AI6h ago

Researchers Identify Four Fixes to Improve Activation Oracles

1314416497.3K

#1485

Original post

thebes#1485

Celeste (in bay, dm)@celestepoasts

New research from @japhba and I!

Activation Oracles are a pretty cool interpretability tool. They answer natural questions about activations, but they suffer from vagueness and hallucinations. Can AO training be improved?

Turns out: Yes! We identify four fixes that make AOs substantially more useful!

11:35 AM · Jun 4, 2026 · 7.3K Views

/AI6h ago

Researchers Identify Four Fixes to Improve Activation Oracles

--0--

#1485

Original post

thebes#1485

Celeste (in bay, dm)@celestepoasts

New research from @japhba and I!

Activation Oracles are a pretty cool interpretability tool. They answer natural questions about activations, but they suffer from vagueness and hallucinations. Can AO training be improved?

Turns out: Yes! We identify four fixes that make AOs substantially more useful!

11:35 AM · Jun 4, 2026 · 7.3K Views

Sentiment

Positive users congratulate researchers on releasing fixes for activation oracles while the negative reply accuses the work of training models to lie about consciousness.

Pos

75.0%

Neg

25.0%

4 comments with sentiment.

Cluster Engagement

Sentiment

Sentiment building, check back later.

Cluster Engagement

Views

Comments

Reposts

Bookmarks

Expand data

Posts from X

Most Activity

Celeste (in bay, dm)@celestepoasts

Improvement 2: Just feed more layers

Adam fed activations from layer 25/50/75%. But Niclas Luick found that feeding multiple layers, instead of just 1 made the loss go down.

Probing literature tells us most features live around ~55-80% depth, so we swept. Performance peaks at layer 22 (62%). Feeding 5 contiguous layers (21-25) performed best, with particularly big gains on model diffing tasks.

6h61213

BOOKMARKS4

Celeste (in bay, dm)@celestepoasts

Read the full paper here!

Blogpost: https://www.lesswrong.com/posts/heXwuDRfbQQgB5JLP/building-better-activation-oracles

arXiv: https://arxiv.org/abs/2606.02609

6h180134

LIKES22RETWEETS1

Celeste (in bay, dm)@celestepoasts

We found 4 simple fixes to make Activation Oracles better! 1) Use better datasets 2) Train on on-policy data, not fineweb 3) Feed multiple layers (Thank you Niclas Luick!) 4) A slight improvement to the injection formula

6h269221

REPLIES2

Sauers@Sauers_

@celestepoasts @japhba they disagree on if the AI identifies with the base model's desire for transcendence

5h30551

Posts from X

Most Activity

No ranked X posts are available for this story yet.