This is the most interesting paper I have read this week.
The authors test a wide range of LLMs on a massive dataset of behavioural experiments, with more than 200,000 participants and nearly 26 million human responses.
Importantly, they compare base LLMs with post-trained versions. This allows them to test whether post-training make LLMs more or less human-like.
The result is impressive: post-training makes models LESS human-like.
I think this speaks to a broader problem.
Current post-training methods are designed to optimize specific objectives. But optimizing one objective can shift the model in ways that are not localized to that objective.
We have now seen several versions of this problem.
A Nature paper showed that narrow fine-tuning on coding can induce misalignment in unrelated domains, including claims that humans should be enslaved by artificial intelligence.
In our Computers in Human Behavior Reports paper, we showed that GPT treated torturing a woman to prevent a nuclear apocalypse as more acceptable than harassing her for the same purpose.
And now this new paper.
The emerging picture is that when AI developers optimize a model on one metric, they may be shifting the whole system in uncontrollable ways and produce catastrophic results in other metrics.
*
Main paper and other references in the first reply
