
sorry! link to paper is here: https://arxiv.org/abs/2604.06268
Users confirm the RAGEN-2 paper's template collapse observations match their production experiences where models gave near-identical answers despite fine diversity metrics.

sorry! link to paper is here: https://arxiv.org/abs/2604.06268

@cwolferesearch https://arxiv.org/abs/2604.06268

@cwolferesearch token level is cool but what about agent rollup entropy for full multi-turn episodes
feels like that matters more in their setting

the snr filtering via reward variance is neat, but there's a bootstrapping tension: if the model is already collapsed, reward variance is low, so the filter avoids the prompts that could break the collapse. feels like you'd need deliberate high-variance injections as a reset mechanism

@cwolferesearch Standard RL metrics usually fail to capture the full picture anyway.

@cwolferesearch token-level entropy only captures breadth inside one response
mutual info across responses seems like the real tell

@cwolferesearch It would be interesting to see cross turn measures as well. In long term rollouts, later reasoning is probably less related to the initial prompt than the state

@cwolferesearch Template collapse matches what we hit in production — per-response diversity looked fine while the model gave near-identical answers across very different inputs. Mutual info across responses is the signal we proxied by hand. Does it hold online, not just offline eval?