/AI7h ago

RAGEN-2 Paper Proposes MI Metrics to Detect Template Collapse in RL

10161141408.2K
Original post
Cameron R. Wolfe, Ph.D.@cwolferesearch#1467inAI

Really interesting paper, one of my favorites I’ve read recently!

Token-level entropy is a common metric used to assess the health of RL training. This paper argues that because token-level entropy only measures diversity within a single response, it does not holistically capture diversity. The model can still respond similarly to different inputs, which is a sign of poor diversity. This type of input-agnostic behavior is referred to as template collapse.

To measure this kind of diversity, a suite of mutual information proxy metrics are proposed that can measure the amount of shared info between responses. These metrics are found to actually correlate more strongly with final performance than entropy, indicating that they may better capture reasoning quality / training health.

1:53 PM · Jun 6, 2026 · 7.7K Views
Sentiment

Users are validating the RAGEN-2 paper because its metrics for spotting template collapse match the production problem of models giving near-identical answers despite apparent per-response diversity.

Pos
100.0%
Neg
0.0%
1 comments with sentiment.
Cluster Engagement
Posts from X
Most Activity
Most Activity
VIEWS537BOOKMARKS3LIKES4

sorry! link to paper is here: https://arxiv.org/abs/2604.06268

Really interesting paper, one of my favorites I’ve read recently!

Token-level entropy is a common metric used to assess the health of RL training. This paper argues that because token-level entropy only measures diversity within a single response, it does not holistically capture diversity. The model can still respond similarly to different inputs, which is a sign of poor diversity. This type of input-agnostic behavior is referred to as template collapse.

To measure this kind of diversity, a suite of mutual information proxy metrics are proposed that can measure the amount of shared info between responses. These metrics are found to actually correlate more strongly with final performance than entropy, indicating that they may better capture reasoning quality / training health.

4hViews 537Likes 4Bookmarks 3
Harris7@KrishWiller

@cwolferesearch https://arxiv.org/abs/2604.06268

5hViews 44Bookmarks 1
Blissy@BlissyOnX

@cwolferesearch token level is cool but what about agent rollup entropy for full multi-turn episodes

feels like that matters more in their setting

7hViews 79
Guilherme O'Tina@guilhermeotina

the snr filtering via reward variance is neat, but there's a bootstrapping tension: if the model is already collapsed, reward variance is low, so the filter avoids the prompts that could break the collapse. feels like you'd need deliberate high-variance injections as a reset mechanism

6hViews 69
Strata@ChainZenit

@cwolferesearch Standard RL metrics usually fail to capture the full picture anyway.

7hViews 60
Rugbist@rugbist_

@cwolferesearch token-level entropy only captures breadth inside one response

mutual info across responses seems like the real tell

7hViews 52
Shuying Luo@shuying_luo

@cwolferesearch It would be interesting to see cross turn measures as well. In long term rollouts, later reasoning is probably less related to the initial prompt than the state

5hViews 34
TecAce@tecaceai

@cwolferesearch Template collapse matches what we hit in production — per-response diversity looked fine while the model gave near-identical answers across very different inputs. Mutual info across responses is the signal we proxied by hand. Does it hold online, not just offline eval?

5hViews 2