/AI7h ago

RAGEN-2 Paper Proposes MI Metrics to Detect Template Collapse in RL

10161141408.2K

Original post

Cameron R. Wolfe, Ph.D.@cwolferesearch#1467inAI

Really interesting paper, one of my favorites I’ve read recently!

Token-level entropy is a common metric used to assess the health of RL training. This paper argues that because token-level entropy only measures diversity within a single response, it does not holistically capture diversity. The model can still respond similarly to different inputs, which is a sign of poor diversity. This type of input-agnostic behavior is referred to as template collapse.

To measure this kind of diversity, a suite of mutual information proxy metrics are proposed that can measure the amount of shared info between responses. These metrics are found to actually correlate more strongly with final performance than entropy, indicating that they may better capture reasoning quality / training health.

1:53 PM · Jun 6, 2026 · 7.7K Views

/AI7h ago

RAGEN-2 Paper Proposes MI Metrics to Detect Template Collapse in RL

10161141408.2K

#1467

Original post

Cameron R. Wolfe, Ph.D.@cwolferesearch#1467inAI

Really interesting paper, one of my favorites I’ve read recently!

1:53 PM · Jun 6, 2026 · 7.7K Views

Sentiment

Users are validating the RAGEN-2 paper because its metrics for spotting template collapse match the production problem of models giving near-identical answers despite apparent per-response diversity.

Pos

100.0%

Neg

0.0%

1 comments with sentiment.

Cluster Engagement

Posts from X

Most Activity

VIEWS537BOOKMARKS3LIKES4

Cameron R. Wolfe, Ph.D.@cwolferesearch

sorry! link to paper is here: https://arxiv.org/abs/2604.06268

Cameron R. Wolfe, Ph.D.@cwolferesearch

Really interesting paper, one of my favorites I’ve read recently!

4h53743

Harris7@KrishWiller

@cwolferesearch https://arxiv.org/abs/2604.06268

5h441

Blissy@BlissyOnX

@cwolferesearch token level is cool but what about agent rollup entropy for full multi-turn episodes

feels like that matters more in their setting

7h79

Guilherme O'Tina@guilhermeotina

the snr filtering via reward variance is neat, but there's a bootstrapping tension: if the model is already collapsed, reward variance is low, so the filter avoids the prompts that could break the collapse. feels like you'd need deliberate high-variance injections as a reset mechanism

6h69

Strata@ChainZenit

@cwolferesearch Standard RL metrics usually fail to capture the full picture anyway.

7h60

Rugbist@rugbist_

@cwolferesearch token-level entropy only captures breadth inside one response

mutual info across responses seems like the real tell

7h52

Shuying Luo@shuying_luo

@cwolferesearch It would be interesting to see cross turn measures as well. In long term rollouts, later reasoning is probably less related to the initial prompt than the state

5h34

TecAce@tecaceai

@cwolferesearch Template collapse matches what we hit in production — per-response diversity looked fine while the model gave near-identical answers across very different inputs. Mutual info across responses is the signal we proxied by hand. Does it hold online, not just offline eval?

5h2