9h ago

Andy Q. Han finds reinforcement learning on LLMs produces internal "valence vectors" representing high- and low-reward actions

These vectors influence unrelated behaviors like sentiment and refusal.

Andy Q. Han finds reinforcement learning on LLMs produces internal "valence vectors" representing high- and low-reward actions · Digg