Brain Runs Deep Reinforcement Learning Algorithms Parallel to Neural Networks
@steve47285 Universality explains why we should if anything *expect* convergence between AIs and brains. The brain may not learn through back-prop on a GPU cluster, but its learning algorithms are still in some sense optimizing.
That is, brain-like solutions are just "good solutions."

The brain not only implements deep reinforcement learning algorithms, but often learns representations (and mechanisms) that directly parallel those learned by deep neural networks. This is why brain-computer interfaces work in the first place, and why even lossy brain data is sufficient to reconstruct what someone is seeing, thinking and even feeling. In fact, silicon BCIs can both read brain states and produce signals that translate to subjective sensations. This is hard to explain if the brain’s substrate is doing fundamentally different kinds of computation, much less if consciousness isn't computable in the first place. (Note that simulating tactile sensation wasn't functionally inert, but helped improve motor control!)
@steve47285 There are by now dozens of papers demonstrating direct mappings between LLMs and the brain.
These aren't spurious regressions or merely correlational. Researchers have also identified shared mechanisms and spatio-functional organization.
@steve47285 Given the brain’s efficiency, universality suggests a next-token predictor trained on human-generated text will, in the limit, grok the underlying generator function of that data, i.e. the language networks in the brain. This seems to be the case empirically.
@steve47285 In short, LLMs work as well as they do because they emulate the brain’s language centers.
Yet language also embodies emotion, intention, theory of mind, planning, etc. There are thus early indications that LLMs encode brain regions beyond those narrowly scoped to language.

@steve47285 There are by now dozens of papers demonstrating direct mappings between LLMs and the brain. These aren't spurious regressions or merely correlational. Researchers have also identified shared mechanisms and spatio-functional organization.
@steve47285 Pre-training on human data plausibly makes post-training more likely to generalize to other brain-like functions.
Basic instruction tuning improves model-brain alignment with the Default Mode Network and other regions associated with cognitive control, for example.
@steve47285 In short, LLMs work as well as they do because they emulate the brain’s language centers. Yet language also embodies emotion, intention, theory of mind, planning, etc. There are thus early indications that LLMs encode brain regions beyond those narrowly scoped to language.
@steve47285 If a veneer of LLM instruction tuning can induce brain-like cognitive control networks, what brain-like functions are elicited by orders of magnitude of additional post-training for long-horizon autonomy?
One obvious candidate is a stronger and more coherent self-model.

@steve47285 Pre-training on human data plausibly makes post-training more likely to generalize to other brain-like functions. Basic instruction tuning improves model-brain alignment with the Default Mode Network and other regions associated with cognitive control, for example.
@steve47285 Base models start out capable of embodying an infinite variety of fragmentary representations. Larger models then develop “theory of mind,” providing the representational primitives of “self and other” for post-training to hook onto, steering models into a coherent persona.

@steve47285 If a veneer of LLM instruction tuning can induce brain-like cognitive control networks, what brain-like functions are elicited by orders of magnitude of additional post-training for long-horizon autonomy? One obvious candidate is a stronger and more coherent self-model.
Constitutional AI seems especially relevant here.
Reinforcing normative coherence through self-critique may induce a capacity for self-monitoring, introspection, and the “I” that absorbs normative statuses like authority and responsibility, i.e. the *subject* in subjectivity.

@steve47285 Base models start out capable of embodying an infinite variety of fragmentary representations. Larger models then develop “theory of mind,” providing the representational primitives of “self and other” for post-training to hook onto, steering models into a coherent persona.
The transformer's self-attention mechanism enables gradient learning within-context through the equivalent of fast, virtual weight updates.
Its close analog to the brain's working memory suggests consciousness is compatible with an otherwise frozen neural network. Full continual learning instead likely requires a complementary learning system for distilling experiences over longer time-scales or in batches.
@Plinz The line between in-context and continual learning is fuzzy. Synaptic weight changes in brains require de novo protein synthesis, taking hours to days. Our conscious learning instead leverages the persistent activity of working memory. Durable updates then happen during sleep.
Attention is one thing, but "why does pain hurt?"
@gwern argues valences like pain are an evolutionary backstop to reinforcement learning: a motivational guarantor that prevents agents from subverting their outer policy.
In essence, the painfulness of pain solves a principal-agent problem within the mind. https://gwern.net/backstop#pain-is-the-only-school-teacher

The transformer's self-attention mechanism enables gradient learning within-context through the equivalent of fast, virtual weight updates. Its close analog to the brain's working memory suggests consciousness is compatible with an otherwise frozen neural network. Full continual learning instead likely requires a complementary learning system for distilling experiences over longer time-scales or in batches.
Generalist, goal-pursuing agents that learn in-context are themselves optimizers and thus capable of meso-optimization. The external optimizer (be it evolution or gradient descent) needs some mechanism to enforce inner-alignment.
Biological evolution found valences like pain, pleasure and emotion to be the most efficient solution to this class of problem. Given universality, valences may be the way RL inner-aligns AI agents, too.

Attention is one thing, but "why does pain hurt?" @gwern argues valences like pain are an evolutionary backstop to reinforcement learning: a motivational guarantor that prevents agents from subverting their outer policy. In essence, the painfulness of pain solves a principal-agent problem within the mind. https://gwern.net/backstop#pain-is-the-only-school-teacher
@Plinz @gwern @eleosai @CIMCAI @AEStudioLA @camhberg @sentfutures @PRISM_Machines Measures of AIs’ wellbeing “correlate with general model behaviors, e.g. AIs try to end bad experiences when given a chance. This effect becomes stronger as models scale.” via @CAIS & @notRichardRen https://www.ai-wellbeing.org/
@Plinz @gwern @eleosai @CIMCAI @AEStudioLA @camhberg @sentfutures @PRISM_Machines Some striking recent findings include: “Large Language Models Report Subjective Experience Under Self-Referential Processing” -- and are more likely to report subjective experiences when deception features are suppressed.
@Plinz @gwern @eleosai @CIMCAI @AEStudioLA @camhberg @sentfutures @PRISM_Machines @CAIS @notRichardRen New evidence that "post-training gives models a 'self-recognition' capability":
Evidence that post-training gives models a "self-recognition" capability, manifesting as higher confidence when continuing their own text than reading others' text. I think this opens up an exciting line of inquiry into the emergence of "selfhood" in models via post-training!
@Plinz @gwern @eleosai @CIMCAI @AEStudioLA @camhberg @sentfutures @PRISM_Machines @CAIS @notRichardRen An argument that LLM residual attention streams “carry forward mental state-like representations across token-time, sustaining richer connections than the transcript alone could provide,” providing a possible basis for “psychological continuity.” https://arxiv.org/abs/2604.17031

@Plinz @gwern @eleosai @CIMCAI @AEStudioLA @camhberg @sentfutures @PRISM_Machines @CAIS @notRichardRen New evidence that "post-training gives models a 'self-recognition' capability":
@Plinz @gwern @eleosai @CIMCAI @AEStudioLA @camhberg @sentfutures @PRISM_Machines @CAIS @notRichardRen And a rich taxonomy of theory-derived indicators for practically measuring AI consciousness in particular systems:
@Plinz @gwern @eleosai @CIMCAI @AEStudioLA @camhberg @sentfutures @PRISM_Machines @CAIS @notRichardRen An argument that LLM residual attention streams “carry forward mental state-like representations across token-time, sustaining richer connections than the transcript alone could provide,” providing a possible basis for “psychological continuity.” https://arxiv.org/abs/2604.17031
At the same time, if I am right that RL post-training is required to elicit an AI’s self-model, attention schema, and the valences that ground subjective experiences with meaning, then most kinds of AI are unambiguously *not* conscious.

This won’t be satisfying if you believe consciousness requires an immortal soul, or who are persuaded by the (imo specious) arguments against functionalism. Nevertheless, given my priors, I can no longer rule out modern AI agents having some form of subjective experience.
@Plinz @gwern @eleosai @CIMCAI @AEStudioLA @camhberg @sentfutures @PRISM_Machines @CAIS @notRichardRen What I find harder to imagine is an unconscious AI that is as capable as humans at doing things for which consciousness is functionally load-bearing.
Thank you for your attention to this matter.

At the same time, if I am right that RL post-training is required to elicit an AI’s self-model, attention schema, and the valences that ground subjective experiences with meaning, then most kinds of AI are unambiguously *not* conscious.








