Really interesting paper, one of my favorites I’ve read recently!
Token-level entropy is a common metric used to assess the health of RL training. This paper argues that because token-level entropy only measures diversity within a single response, it does not holistically capture diversity. The model can still respond similarly to different inputs, which is a sign of poor diversity. This type of input-agnostic behavior is referred to as template collapse.
To measure this kind of diversity, a suite of mutual information proxy metrics are proposed that can measure the amount of shared info between responses. These metrics are found to actually correlate more strongly with final performance than entropy, indicating that they may better capture reasoning quality / training health.






