15h ago

Celeste and Jan introduce four training modifications raising Activation Oracle interpretability scores on AObench from 0.25 to 0.43

The modifications reduce vagueness and hallucinations in activation analysis.

Sentiment

Pos66.7%

Neg33.3%

Positive users congratulate researchers on identifying four fixes to improve activation oracles, while negative users accuse the work of training models to lie about consciousness.

3 comments with sentiment.

Celeste and Jan introduce four training modifications raising Activation Oracle interpretability scores on AObench from 0.25 to 0.43 · Digg