Higher benchmark scores do not always mean better models for users.
Why? We claim that RL teaches LMs to be correct but not how to be correct: code can pass tests but be unreadable; explanations can be right but unclear.
How do we train LMs to be right in the right way?
(1/n)
