15h agoCeleste and Jan introduce four training modifications raising Activation Oracle interpretability scores on AObench from 0.25 to 0.43The modifications reduce vagueness and hallucinations in activation analysis.SentimentSentimentPos66.7%Neg33.3%Positive users congratulate researchers on identifying four fixes to improve activation oracles, while negative users accuse the work of training models to lie about consciousness.3 comments with sentiment. View comments.