๐๐จ๐ง๐ ๐๐ซ ๐ซ๐๐๐ฌ๐จ๐ง๐ข๐ง๐ โ ๐๐๐ญ๐ญ๐๐ซ ๐ซ๐๐๐ฌ๐จ๐ง๐ข๐ง๐ . ๐๐ก๐ข๐ฌ ๐๐ซ๐๐๐ค๐ฌ ๐ญ๐ก๐ ๐ฌ๐๐๐ฅ๐ข๐ง๐ ๐ฅ๐๐ฐ. ๐๐๐ง ๐ฐ๐ ๐๐ข๐ฑ ๐ข๐ญ?
On given problems, CoT accuracy follows an inverted-U: It rises, peaks, then falls as the chain grows longer. Harder problems push the peak rightward, but the cliff is always there. Test-time scaling has a ceiling that few talk about.
So we asked: why does extra thinking hurt?
We measured pre-softmax attention from answer tokens back to the critical insights buried earlier in the chain: the small subset of sentences that actually determine the final answer. The decay is monotonic with distance. The longer the model reasons, the less access it has to the very conclusions that matter most. It's reasoning with a fading memory of its own best ideas.
This is the same problem sequence models have always faced. LSTMs solved it with an explicit memory cell that persists and updates as the sequence unfolds. The fix for long CoT should look the same.
That's what we built. We call it InsightReplay, stateful reasoning for CoT.
The reasoning state at any point is the cumulative set of insights the model has generated so far, compressed abstractions of prior reasoning. InsightReplay periodically extracts these insights and replays them near the active generation frontier, keeping them close to the decoding position so attention stays intact.
What happens when you do this: The baseline peaks around 15K tokens on LiveCodeBench and then degrades. InsightReplay operates precisely in that degradation regime. 1 replay round improves accuracy. 3 rounds exceeds the baseline's peak. 5 rounds keeps climbing.
The degradation regime becomes a continued-growth regime. 1 replay round improves accuracy. 3 rounds exceed the baseline's peak. 5 . 1 replay round improves accuracy. 3 rounds exceed the baseline's peak. 5 rounds keep climbing. โ Critical insights and the surrounding trace are complementary โ you need both โ Attention to insights decays as CoT grows. This is the bottleneck โ Replaying insights near the frontier shifts the optimal reasoning length rightward and raises the peak
Works at pure inference time across 30B-tier models. No training required. Post-training on this pattern improves both stability and performance over vanilla CoT.
Test-time scaling isn't just about reasoning longer. It's about keeping the right state accessible.

๐๐จ๐ง๐ ๐๐ซ ๐ซ๐๐๐ฌ๐จ๐ง๐ข๐ง๐ โ ๐๐๐ญ๐ญ๐๐ซ ๐ซ๐๐๐ฌ๐จ๐ง๐ข๐ง๐ . ๐๐ก๐ข๐ฌ ๐๐ซ๐๐๐ค๐ฌ ๐ญ๐ก๐ ๐ฌ๐๐๐ฅ๐ข๐ง๐ ๐ฅ๐๐ฐ. ๐๐๐ง ๐ฐ๐ ๐๐ข๐ฑ ๐ข๐ญ?
On given problems, CoT accuracy follows an inverted-U: It rises, peaks, then falls as the chain grows longer. Harder problems push the peak rightward, but the cliff is always there. Test-time scaling has a ceiling that few talk about.
So we asked: why does extra thinking hurt?
We measured pre-softmax attention from answer tokens back to the critical insights buried earlier in the chain: the small subset of sentences that actually determine the final answer. The decay is monotonic with distance. The longer the model reasons, the less access it has to the very conclusions that matter most. It's reasoning with a fading memory of its own best ideas.
This is the same problem sequence models have always faced. LSTMs solved it with an explicit memory cell that persists and updates as the sequence unfolds. The fix for long CoT should look the same.
๐๐ก๐๐ญ'๐ฌ ๐ฐ๐ก๐๐ญ ๐ฐ๐ ๐๐ฎ๐ข๐ฅ๐ญ. ๐๐ ๐๐๐ฅ๐ฅ ๐ข๐ญ ๐๐ง๐ฌ๐ข๐ ๐ก๐ญ๐๐๐ฉ๐ฅ๐๐ฒ, ๐ฌ๐ญ๐๐ญ๐๐๐ฎ๐ฅ ๐ซ๐๐๐ฌ๐จ๐ง๐ข๐ง๐ ๐๐จ๐ซ ๐๐จ๐.
The reasoning state at any point is the cumulative set of insights the model has generated so far, compressed abstractions of prior reasoning. InsightReplay periodically extracts these insights and replays them near the active generation frontier, keeping them close to the decoding position so attention stays intact.
What happens when you do this: The baseline peaks around 15K tokens on LiveCodeBench and then degrades. InsightReplay operates precisely in that degradation regime. 1 replay round improves accuracy. 3 rounds exceeds the baseline's peak. 5 rounds keeps climbing.
The degradation regime becomes a continued-growth regime. โ Critical insights and the surrounding trace are complementary โ you need both โ Attention to insights decays as CoT grows. This is the bottleneck โ Replaying insights near the frontier shifts the optimal reasoning length rightward and raises the peak
Works at pure inference time across 30B-tier models. No training required. Post-training on this pattern improves both stability and performance over vanilla CoT.
๐๐๐ฌ๐ญ-๐ญ๐ข๐ฆ๐ ๐ฌ๐๐๐ฅ๐ข๐ง๐ ๐ข๐ฌ๐ง'๐ญ ๐ฃ๐ฎ๐ฌ๐ญ ๐๐๐จ๐ฎ๐ญ ๐ซ๐๐๐ฌ๐จ๐ง๐ข๐ง๐ ๐ฅ๐จ๐ง๐ ๐๐ซ. ๐๐ญ'๐ฌ ๐๐๐จ๐ฎ๐ญ ๐ค๐๐๐ฉ๐ข๐ง๐ ๐ญ๐ก๐ ๐ซ๐ข๐ ๐ก๐ญ ๐ฌ๐ญ๐๐ญ๐ ๐๐๐๐๐ฌ๐ฌ๐ข๐๐ฅ๐.

๐งต1/N
The folk story about chain-of-thought: longer reasoning โ better accuracy.
Recent studies show the curve isn't monotone. For a fixed problem, accuracy follows an inverted-U with CoT length: it rises, peaks, then declines. That peak puts a ceiling on what test-time scaling can buy you, and we'd like to push past it.

๐๐จ๐ง๐ ๐๐ซ ๐ซ๐๐๐ฌ๐จ๐ง๐ข๐ง๐ โ ๐๐๐ญ๐ญ๐๐ซ ๐ซ๐๐๐ฌ๐จ๐ง๐ข๐ง๐ . ๐๐ก๐ข๐ฌ ๐๐ซ๐๐๐ค๐ฌ ๐ญ๐ก๐ ๐ฌ๐๐๐ฅ๐ข๐ง๐ ๐ฅ๐๐ฐ. ๐๐๐ง ๐ฐ๐ ๐๐ข๐ฑ ๐ข๐ญ? On given problems, CoT accuracy follows an inverted-U: It rises, peaks, then falls as the chain grows longer. Harder problems push the peak rightward, but the cliff is always there. Test-time scaling has a ceiling that few talk about. So we asked: why does extra thinking hurt? We measured pre-softmax attention from answer tokens back to the critical insights buried earlier in the chain: the small subset of sentences that actually determine the final answer. The decay is monotonic with distance. The longer the model reasons, the less access it has to the very conclusions that matter most. It's reasoning with a fading memory of its own best ideas. This is the same problem sequence models have always faced. LSTMs solved it with an explicit memory cell that persists and updates as the sequence unfolds. The fix for long CoT should look the same. ๐๐ก๐๐ญ'๐ฌ ๐ฐ๐ก๐๐ญ ๐ฐ๐ ๐๐ฎ๐ข๐ฅ๐ญ. ๐๐ ๐๐๐ฅ๐ฅ ๐ข๐ญ ๐๐ง๐ฌ๐ข๐ ๐ก๐ญ๐๐๐ฉ๐ฅ๐๐ฒ, ๐ฌ๐ญ๐๐ญ๐๐๐ฎ๐ฅ ๐ซ๐๐๐ฌ๐จ๐ง๐ข๐ง๐ ๐๐จ๐ซ ๐๐จ๐. The reasoning state at any point is the cumulative set of insights the model has generated so far, compressed abstractions of prior reasoning. InsightReplay periodically extracts these insights and replays them near the active generation frontier, keeping them close to the decoding position so attention stays intact. What happens when you do this: The baseline peaks around 15K tokens on LiveCodeBench and then degrades. InsightReplay operates precisely in that degradation regime. 1 replay round improves accuracy. 3 rounds exceeds the baseline's peak. 5 rounds keeps climbing. The degradation regime becomes a continued-growth regime. โ Critical insights and the surrounding trace are complementary โ you need both โ Attention to insights decays as CoT grows. This is the bottleneck โ Replaying insights near the frontier shifts the optimal reasoning length rightward and raises the peak Works at pure inference time across 30B-tier models. No training required. Post-training on this pattern improves both stability and performance over vanilla CoT. ๐๐๐ฌ๐ญ-๐ญ๐ข๐ฆ๐ ๐ฌ๐๐๐ฅ๐ข๐ง๐ ๐ข๐ฌ๐ง'๐ญ ๐ฃ๐ฎ๐ฌ๐ญ ๐๐๐จ๐ฎ๐ญ ๐ซ๐๐๐ฌ๐จ๐ง๐ข๐ง๐ ๐ฅ๐จ๐ง๐ ๐๐ซ. ๐๐ญ'๐ฌ ๐๐๐จ๐ฎ๐ญ ๐ค๐๐๐ฉ๐ข๐ง๐ ๐ญ๐ก๐ ๐ซ๐ข๐ ๐ก๐ญ ๐ฌ๐ญ๐๐ญ๐ ๐๐๐๐๐ฌ๐ฌ๐ข๐๐ฅ๐.
๐งต2/N
Why? We identify one driver of this decline: attention from the answer position to the trace's *critical insights* decays monotonically with distance.
Qwen3-8B: -19.2% end-to-end attention decay. Gemma-4-31B-it: -3.3% (its hybrid local/global attention helps).
The longer the model reasons, the less accessible its own earlier conclusions become.

๐งต1/N The folk story about chain-of-thought: longer reasoning โ better accuracy. Recent studies show the curve isn't monotone. For a fixed problem, accuracy follows an inverted-U with CoT length: it rises, peaks, then declines. That peak puts a ceiling on what test-time scaling can buy you, and we'd like to push past it.
๐งต2/N: The attention decay problem in long CoT
We identify one driver of this decline: attention from the answer position to the trace's *critical insights* decays monotonically with distance.
Qwen3-8B: -19.2% end-to-end attention decay. Gemma-4-31B-it: -3.3% (its hybrid local/global attention helps).
The longer the model reasons, the less accessible its own earlier conclusions become.

๐งต1/N The folk story about chain-of-thought: longer reasoning โ better accuracy. Recent studies show the curve isn't monotone. For a fixed problem, accuracy follows an inverted-U with CoT length: it rises, peaks, then declines. That peak puts a ceiling on what test-time scaling can buy you, and we'd like to push past it.
๐งต3/N: An old problem in new clothes: Analogy to LSTM
This is the same challenge sequence models have always faced: preserving information across long stretches of a sequence.
LSTMs solved it in 1997 (@SchmidhuberAI) with an explicit memory cell: a dedicated state that persists across time and gets continually updated as the sequence unfolds. The lesson wasn't "make recurrence stronger." It was "give the network a maintained memory."
Long CoT has the same shape of problem. The "sequence" is the reasoning trace. The "long-range information" is the critical insights produced earlier. Attention to them decays with distance.
So the natural question: can we equip a growing reasoning chain with an analogous mechanism โ one that keeps critical insights accessible as the chain extends?
That's what InsightReplay does. โ
๐งต2/N: The attention decay problem in long CoT We identify one driver of this decline: attention from the answer position to the trace's *critical insights* decays monotonically with distance. Qwen3-8B: -19.2% end-to-end attention decay. Gemma-4-31B-it: -3.3% (its hybrid local/global attention helps). The longer the model reasons, the less accessible its own earlier conclusions become.
๐งต4/N: InsightReplay, a stateful reasoning approach
Reasoning unfolds as two interleaving streams: โ Reasoning chunks (R_t): step-by-step thinking โ Insights (I_t): short distilled conclusions the model writes about its own prior reasoning
At each round t, the model (i) generates a reasoning chunk, then (ii) summarizes the conclusions reached so far into a new insight. The new insight is appended right before the next round of thinking begins.
Two consequences fall out: The latest insight always sits adjacent to the active generation frontier. Attention to critical conclusions stays intact no matter how long the chain grows. We're not making the model think more. We're keeping the important state close.
Each new insight is generated in the presence of all prior ones. So the insight trace isn't a flat list of summaries โ it's an evolving abstraction. New insights can correct, refine, or supersede earlier ones.
The reasoning state at any point isn't the raw trace. It's the cumulative set of distilled conclusions, kept near the position where the model is actually decoding.

๐งต3/N: An old problem in new clothes: Analogy to LSTM This is the same challenge sequence models have always faced: preserving information across long stretches of a sequence. LSTMs solved it in 1997 (@SchmidhuberAI) with an explicit memory cell: a dedicated state that persists across time and gets continually updated as the sequence unfolds. The lesson wasn't "make recurrence stronger." It was "give the network a maintained memory." Long CoT has the same shape of problem. The "sequence" is the reasoning trace. The "long-range information" is the critical insights produced earlier. Attention to them decays with distance. So the natural question: can we equip a growing reasoning chain with an analogous mechanism โ one that keeps critical insights accessible as the chain extends? That's what InsightReplay does. โ
๐งต5/N: A Theoretical Analysis of InsightReplay
The empirical inverted-U has a clean theoretical model. Wu et al. (2025) parameterize M-step CoT accuracy as a product over per-step success probabilities โ but assume each step's accuracy is independent of its position in the chain.
Section 2/N says that's wrong: attention to critical insights decays with distance.
We restore that dependence by introducing an insight accessibility function ฮธ(ฮด): โ ฮธ(0) = 1 (adjacent โ full access) โ ฮธ strictly decreasing in ฮด โ ฮธ(ฮด) โ 0 as ฮด โ โ
In standard CoT, the model pays a multiplicative ฮธ(t) penalty at every step โ and the penalty worsens as the chain extends. In InsightReplay, insights are relocated to a fixed near-frontier distance ฮด_0, so the penalty becomes the constant ฮธ(ฮด_0) > ฮธ(t) for all t โฅ 1.
Two theorems drop out: Theorem 1 โ InsightReplay shifts the optimal CoT length rightward: M_IR > M_ฮธ Theorem 2 โ InsightReplay raises the achievable peak accuracy: A_IR(M_IR) > A_ฮธ(M_ฮธ)
These map exactly onto the two phenomena from Figure 1: the peak moves right, and it goes up.
The proof doesn't say InsightReplay is the only solution. It says any mechanism that keeps ฮธ(ฮด) bounded above the natural decay floor reshapes the inverted-U in this direction.
๐งต4/N: InsightReplay, a stateful reasoning approach Reasoning unfolds as two interleaving streams: โ Reasoning chunks (R_t): step-by-step thinking โ Insights (I_t): short distilled conclusions the model writes about its own prior reasoning At each round t, the model (i) generates a reasoning chunk, then (ii) summarizes the conclusions reached so far into a new insight. The new insight is appended right before the next round of thinking begins. Two consequences fall out: The latest insight always sits adjacent to the active generation frontier. Attention to critical conclusions stays intact no matter how long the chain grows. We're not making the model think more. We're keeping the important state close. Each new insight is generated in the presence of all prior ones. So the insight trace isn't a flat list of summaries โ it's an evolving abstraction. New insights can correct, refine, or supersede earlier ones. The reasoning state at any point isn't the raw trace. It's the cumulative set of distilled conclusions, kept near the position where the model is actually decoding.
๐งต6/N: InsightReplay works, and longer thinking alone doesn't
24 settings (2 scales ร 3 families ร 4 benchmarks). 3-round InsightReplay: non-negative gains on every cell, +1.65 macro avg, largest +9.2 on R1-Distill-32B / LCB.
The control matters more than the headline.
"Verify-Only" gives the model the same extra tokens and a "wait, let me double-check" cue โ no insight extraction. It captures only +0.61 of the +1.65 gain. The remaining +1.04 (over 60%) comes from InsightReplay itself.
Longer thinking is not the active ingredient. Keeping critical state accessible is.
The inverted-U becomes continued growth: at 5 replay rounds on LCB, accuracy is still climbing where standard CoT has already collapsed.
๐งต5/N: A Theoretical Analysis of InsightReplay The empirical inverted-U has a clean theoretical model. Wu et al. (2025) parameterize M-step CoT accuracy as a product over per-step success probabilities โ but assume each step's accuracy is independent of its position in the chain. Section 2/N says that's wrong: attention to critical insights decays with distance. We restore that dependence by introducing an insight accessibility function ฮธ(ฮด): โ ฮธ(0) = 1 (adjacent โ full access) โ ฮธ strictly decreasing in ฮด โ ฮธ(ฮด) โ 0 as ฮด โ โ In standard CoT, the model pays a multiplicative ฮธ(t) penalty at every step โ and the penalty worsens as the chain extends. In InsightReplay, insights are relocated to a fixed near-frontier distance ฮด_0, so the penalty becomes the constant ฮธ(ฮด_0) > ฮธ(t) for all t โฅ 1. Two theorems drop out: Theorem 1 โ InsightReplay shifts the optimal CoT length rightward: M_IR > M_ฮธ Theorem 2 โ InsightReplay raises the achievable peak accuracy: A_IR(M_IR) > A_ฮธ(M_ฮธ) These map exactly onto the two phenomena from Figure 1: the peak moves right, and it goes up. The proof doesn't say InsightReplay is the only solution. It says any mechanism that keeps ฮธ(ฮด) bounded above the natural decay floor reshapes the inverted-U in this direction.
๐งต7/N: It also reinforces via RL post-training
We trained Qwen3-4B-Base with GRPO on DAPO-Math-15K. Baseline: standard CoT rollouts. InsightReplay: when the policy emits EOS, splice in a fixed continuation cue and let generation continue under the same length budget. Cue tokens are masked out of the loss โ the only signal is the reasoning pattern itself.
Three findings on AIME 2025:
1. Stability. The baseline degrades after step ~300: mean@32 drops 27.9% โ 21.6%, best@32 falls 57.5% โ 42.0%. InsightReplay stays stable through step 600.
2. Peak. InsightReplay wins on all three metrics. Largest gap on maj@32 (the consistency-sensitive metric): 38.8% vs 34.1%, +4.7 points.
3. Not an initialization effect. The first ~200 steps overlap within noise. Divergence emerges only after step ~250 โ the benefit is produced during training, not inherited from setup.
InsightReplay isn't just an inference trick. It's a reasoning pattern that can be reinforced as easily as it can be prompted.

๐งต6/N: InsightReplay works, and longer thinking alone doesn't 24 settings (2 scales ร 3 families ร 4 benchmarks). 3-round InsightReplay: non-negative gains on every cell, +1.65 macro avg, largest +9.2 on R1-Distill-32B / LCB. The control matters more than the headline. "Verify-Only" gives the model the same extra tokens and a "wait, let me double-check" cue โ no insight extraction. It captures only +0.61 of the +1.65 gain. The remaining +1.04 (over 60%) comes from InsightReplay itself. Longer thinking is not the active ingredient. Keeping critical state accessible is. The inverted-U becomes continued growth: at 5 replay rounds on LCB, accuracy is still climbing where standard CoT has already collapsed.
๐งต8/8
Test-time scaling has been framed as "let the model think longer." Our results suggest a sharper frame: it's about whether critical intermediate state stays accessible across long reasoning trajectories. Reasoning depth without state preservation hits a ceiling. Stateful reasoning moves it.
Project website: https://research.simular.ai/insight-replay/ Paper:ย https://arxiv.org/abs/2605.14457 Code: https://github.com/simular-ai/insight-replay
Kudos to the InsightReplay team: Bin, Caiwen, Jiachen, @angli_ai, @xwang_lk. A joint collaboration between @SimularAI and U Minnesota.
๐งต7/N: It also reinforces via RL post-training We trained Qwen3-4B-Base with GRPO on DAPO-Math-15K. Baseline: standard CoT rollouts. InsightReplay: when the policy emits EOS, splice in a fixed continuation cue and let generation continue under the same length budget. Cue tokens are masked out of the loss โ the only signal is the reasoning pattern itself. Three findings on AIME 2025: 1. Stability. The baseline degrades after step ~300: mean@32 drops 27.9% โ 21.6%, best@32 falls 57.5% โ 42.0%. InsightReplay stays stable through step 600. 2. Peak. InsightReplay wins on all three metrics. Largest gap on maj@32 (the consistency-sensitive metric): 38.8% vs 34.1%, +4.7 points. 3. Not an initialization effect. The first ~200 steps overlap within noise. Divergence emerges only after step ~250 โ the benefit is produced during training, not inherited from setup. InsightReplay isn't just an inference trick. It's a reasoning pattern that can be reinforced as easily as it can be prompted.

