here's a new paper (co-authored with @andy_q_han and @Pavel_Izmailov) on an apparent "functional welfare axis" in the activation space of language models. this axis seems to track how well a system is achieving its (quasi-)goals, and it steers welfare-related behaviors.
in models trained with RL on a maze task, the axis tracks reward. more surprisingly, even prior to RL, the axis seems to track and steer functional welfare in a related way, and it is later recruited by RL to serve as a reward axis.
this phenomenon is of technical interest in understanding RL, and it's also of philosophical interest. functional welfare is not the sort of full-blown welfare, involving consciousness and mental states, which confers moral status. it's defined in terms of how well a system is meeting its quasi-goals, and quasi-goals are defined in terms of behavior (roughly a system has X as a quasi-goal if behaves as if it has that X as a goal).
nevertheless, it may well be that functional welfare is one aspect of full-blown welfare, and the existence of a functional welfare axis raises philosophically interesting questions about whether there could be an axis for full-blown welfare in more advanced AI systems.
i should say that i am very much a minor co-author on this piece, which is spearheaded by the amazing @andy_q_han, a first-year computer science ph.d. student at NYU and an anthropic fellow, with guidance from @Pavel_Izmailov, computer science prof at NYU, formerly at openAI and now part-time at anthropic. i came on board mostly to help with the philosophical interpretation of the results.
i don't know for sure that the functional welfare hypothesis is correct (especially where base models are concerned), and other interpretations are available (e.g. that it's a confidence axis), but the axis is fascinating in any case and i think it will repay study.
all the details can be found at http://functionalwelfare.com or at https://arxiv.org/abs/2605.30232.