1d ago

InsightReplay Overcomes Accuracy Decay in Long Chain-of-Thought Reasoning

โ€”โ€”0โ€”โ€”
Original post

๐‹๐จ๐ง๐ ๐ž๐ซ ๐ซ๐ž๐š๐ฌ๐จ๐ง๐ข๐ง๐  โ‰  ๐›๐ž๐ญ๐ญ๐ž๐ซ ๐ซ๐ž๐š๐ฌ๐จ๐ง๐ข๐ง๐ . ๐“๐ก๐ข๐ฌ ๐›๐ซ๐ž๐š๐ค๐ฌ ๐ญ๐ก๐ž ๐ฌ๐œ๐š๐ฅ๐ข๐ง๐  ๐ฅ๐š๐ฐ. ๐‚๐š๐ง ๐ฐ๐ž ๐Ÿ๐ข๐ฑ ๐ข๐ญ? On given problems, CoT accuracy follows an inverted-U: It rises, peaks, then falls as the chain grows longer. Harder problems push the peak rightward, but the cliff is always there. Test-time scaling has a ceiling that few talk about. So we asked: why does extra thinking hurt? We measured pre-softmax attention from answer tokens back to the critical insights buried earlier in the chain: the small subset of sentences that actually determine the final answer. The decay is monotonic with distance. The longer the model reasons, the less access it has to the very conclusions that matter most. It's reasoning with a fading memory of its own best ideas. This is the same problem sequence models have always faced. LSTMs solved it with an explicit memory cell that persists and updates as the sequence unfolds. The fix for long CoT should look the same. That's what we built. We call it InsightReplay, stateful reasoning for CoT. The reasoning state at any point is the cumulative set of insights the model has generated so far, compressed abstractions of prior reasoning. InsightReplay periodically extracts these insights and replays them near the active generation frontier, keeping them close to the decoding position so attention stays intact. What happens when you do this: The baseline peaks around 15K tokens on LiveCodeBench and then degrades. InsightReplay operates precisely in that degradation regime. 1 replay round improves accuracy. 3 rounds exceeds the baseline's peak. 5 rounds keeps climbing. The degradation regime becomes a continued-growth regime. 1 replay round improves accuracy. 3 rounds exceed the baseline's peak. 5 . 1 replay round improves accuracy. 3 rounds exceed the baseline's peak. 5 rounds keep climbing. โ†’ Critical insights and the surrounding trace are complementary โ€” you need both โ†’ Attention to insights decays as CoT grows. This is the bottleneck โ†’ Replaying insights near the frontier shifts the optimal reasoning length rightward and raises the peak Works at pure inference time across 30B-tier models. No training required. Post-training on this pattern improves both stability and performance over vanilla CoT. Test-time scaling isn't just about reasoning longer. It's about keeping the right state accessible.

9:09 AM ยท May 15, 2026 View on X
Reposted by

๐‹๐จ๐ง๐ ๐ž๐ซ ๐ซ๐ž๐š๐ฌ๐จ๐ง๐ข๐ง๐  โ‰  ๐›๐ž๐ญ๐ญ๐ž๐ซ ๐ซ๐ž๐š๐ฌ๐จ๐ง๐ข๐ง๐ . ๐“๐ก๐ข๐ฌ ๐›๐ซ๐ž๐š๐ค๐ฌ ๐ญ๐ก๐ž ๐ฌ๐œ๐š๐ฅ๐ข๐ง๐  ๐ฅ๐š๐ฐ. ๐‚๐š๐ง ๐ฐ๐ž ๐Ÿ๐ข๐ฑ ๐ข๐ญ?

On given problems, CoT accuracy follows an inverted-U: It rises, peaks, then falls as the chain grows longer. Harder problems push the peak rightward, but the cliff is always there. Test-time scaling has a ceiling that few talk about.

So we asked: why does extra thinking hurt?

We measured pre-softmax attention from answer tokens back to the critical insights buried earlier in the chain: the small subset of sentences that actually determine the final answer. The decay is monotonic with distance. The longer the model reasons, the less access it has to the very conclusions that matter most. It's reasoning with a fading memory of its own best ideas.

This is the same problem sequence models have always faced. LSTMs solved it with an explicit memory cell that persists and updates as the sequence unfolds. The fix for long CoT should look the same.

That's what we built. We call it InsightReplay, stateful reasoning for CoT.

The reasoning state at any point is the cumulative set of insights the model has generated so far, compressed abstractions of prior reasoning. InsightReplay periodically extracts these insights and replays them near the active generation frontier, keeping them close to the decoding position so attention stays intact.

What happens when you do this: The baseline peaks around 15K tokens on LiveCodeBench and then degrades. InsightReplay operates precisely in that degradation regime. 1 replay round improves accuracy. 3 rounds exceeds the baseline's peak. 5 rounds keeps climbing.

The degradation regime becomes a continued-growth regime. 1 replay round improves accuracy. 3 rounds exceed the baseline's peak. 5 . 1 replay round improves accuracy. 3 rounds exceed the baseline's peak. 5 rounds keep climbing. โ†’ Critical insights and the surrounding trace are complementary โ€” you need both โ†’ Attention to insights decays as CoT grows. This is the bottleneck โ†’ Replaying insights near the frontier shifts the optimal reasoning length rightward and raises the peak

Works at pure inference time across 30B-tier models. No training required. Post-training on this pattern improves both stability and performance over vanilla CoT.

Test-time scaling isn't just about reasoning longer. It's about keeping the right state accessible.

4:09 PM ยท May 15, 2026 ยท 20 Views

๐‹๐จ๐ง๐ ๐ž๐ซ ๐ซ๐ž๐š๐ฌ๐จ๐ง๐ข๐ง๐  โ‰  ๐›๐ž๐ญ๐ญ๐ž๐ซ ๐ซ๐ž๐š๐ฌ๐จ๐ง๐ข๐ง๐ . ๐“๐ก๐ข๐ฌ ๐›๐ซ๐ž๐š๐ค๐ฌ ๐ญ๐ก๐ž ๐ฌ๐œ๐š๐ฅ๐ข๐ง๐  ๐ฅ๐š๐ฐ. ๐‚๐š๐ง ๐ฐ๐ž ๐Ÿ๐ข๐ฑ ๐ข๐ญ?

On given problems, CoT accuracy follows an inverted-U: It rises, peaks, then falls as the chain grows longer. Harder problems push the peak rightward, but the cliff is always there. Test-time scaling has a ceiling that few talk about.

So we asked: why does extra thinking hurt?

We measured pre-softmax attention from answer tokens back to the critical insights buried earlier in the chain: the small subset of sentences that actually determine the final answer. The decay is monotonic with distance. The longer the model reasons, the less access it has to the very conclusions that matter most. It's reasoning with a fading memory of its own best ideas.

This is the same problem sequence models have always faced. LSTMs solved it with an explicit memory cell that persists and updates as the sequence unfolds. The fix for long CoT should look the same.

๐“๐ก๐š๐ญ'๐ฌ ๐ฐ๐ก๐š๐ญ ๐ฐ๐ž ๐›๐ฎ๐ข๐ฅ๐ญ. ๐–๐ž ๐œ๐š๐ฅ๐ฅ ๐ข๐ญ ๐ˆ๐ง๐ฌ๐ข๐ ๐ก๐ญ๐‘๐ž๐ฉ๐ฅ๐š๐ฒ, ๐ฌ๐ญ๐š๐ญ๐ž๐Ÿ๐ฎ๐ฅ ๐ซ๐ž๐š๐ฌ๐จ๐ง๐ข๐ง๐  ๐Ÿ๐จ๐ซ ๐‚๐จ๐“.

The reasoning state at any point is the cumulative set of insights the model has generated so far, compressed abstractions of prior reasoning. InsightReplay periodically extracts these insights and replays them near the active generation frontier, keeping them close to the decoding position so attention stays intact.

What happens when you do this: The baseline peaks around 15K tokens on LiveCodeBench and then degrades. InsightReplay operates precisely in that degradation regime. 1 replay round improves accuracy. 3 rounds exceeds the baseline's peak. 5 rounds keeps climbing.

The degradation regime becomes a continued-growth regime. โ†’ Critical insights and the surrounding trace are complementary โ€” you need both โ†’ Attention to insights decays as CoT grows. This is the bottleneck โ†’ Replaying insights near the frontier shifts the optimal reasoning length rightward and raises the peak

Works at pure inference time across 30B-tier models. No training required. Post-training on this pattern improves both stability and performance over vanilla CoT.

๐“๐ž๐ฌ๐ญ-๐ญ๐ข๐ฆ๐ž ๐ฌ๐œ๐š๐ฅ๐ข๐ง๐  ๐ข๐ฌ๐ง'๐ญ ๐ฃ๐ฎ๐ฌ๐ญ ๐š๐›๐จ๐ฎ๐ญ ๐ซ๐ž๐š๐ฌ๐จ๐ง๐ข๐ง๐  ๐ฅ๐จ๐ง๐ ๐ž๐ซ. ๐ˆ๐ญ'๐ฌ ๐š๐›๐จ๐ฎ๐ญ ๐ค๐ž๐ž๐ฉ๐ข๐ง๐  ๐ญ๐ก๐ž ๐ซ๐ข๐ ๐ก๐ญ ๐ฌ๐ญ๐š๐ญ๐ž ๐š๐œ๐œ๐ž๐ฌ๐ฌ๐ข๐›๐ฅ๐ž.

4:16 PM ยท May 15, 2026 ยท 6.7K Views

๐Ÿงต1/N

The folk story about chain-of-thought: longer reasoning โ†’ better accuracy.

Recent studies show the curve isn't monotone. For a fixed problem, accuracy follows an inverted-U with CoT length: it rises, peaks, then declines. That peak puts a ceiling on what test-time scaling can buy you, and we'd like to push past it.

Xin Eric Wang (hiring postdoc)Xin Eric Wang (hiring postdoc)@xwang_lk

๐‹๐จ๐ง๐ ๐ž๐ซ ๐ซ๐ž๐š๐ฌ๐จ๐ง๐ข๐ง๐  โ‰  ๐›๐ž๐ญ๐ญ๐ž๐ซ ๐ซ๐ž๐š๐ฌ๐จ๐ง๐ข๐ง๐ . ๐“๐ก๐ข๐ฌ ๐›๐ซ๐ž๐š๐ค๐ฌ ๐ญ๐ก๐ž ๐ฌ๐œ๐š๐ฅ๐ข๐ง๐  ๐ฅ๐š๐ฐ. ๐‚๐š๐ง ๐ฐ๐ž ๐Ÿ๐ข๐ฑ ๐ข๐ญ? On given problems, CoT accuracy follows an inverted-U: It rises, peaks, then falls as the chain grows longer. Harder problems push the peak rightward, but the cliff is always there. Test-time scaling has a ceiling that few talk about. So we asked: why does extra thinking hurt? We measured pre-softmax attention from answer tokens back to the critical insights buried earlier in the chain: the small subset of sentences that actually determine the final answer. The decay is monotonic with distance. The longer the model reasons, the less access it has to the very conclusions that matter most. It's reasoning with a fading memory of its own best ideas. This is the same problem sequence models have always faced. LSTMs solved it with an explicit memory cell that persists and updates as the sequence unfolds. The fix for long CoT should look the same. ๐“๐ก๐š๐ญ'๐ฌ ๐ฐ๐ก๐š๐ญ ๐ฐ๐ž ๐›๐ฎ๐ข๐ฅ๐ญ. ๐–๐ž ๐œ๐š๐ฅ๐ฅ ๐ข๐ญ ๐ˆ๐ง๐ฌ๐ข๐ ๐ก๐ญ๐‘๐ž๐ฉ๐ฅ๐š๐ฒ, ๐ฌ๐ญ๐š๐ญ๐ž๐Ÿ๐ฎ๐ฅ ๐ซ๐ž๐š๐ฌ๐จ๐ง๐ข๐ง๐  ๐Ÿ๐จ๐ซ ๐‚๐จ๐“. The reasoning state at any point is the cumulative set of insights the model has generated so far, compressed abstractions of prior reasoning. InsightReplay periodically extracts these insights and replays them near the active generation frontier, keeping them close to the decoding position so attention stays intact. What happens when you do this: The baseline peaks around 15K tokens on LiveCodeBench and then degrades. InsightReplay operates precisely in that degradation regime. 1 replay round improves accuracy. 3 rounds exceeds the baseline's peak. 5 rounds keeps climbing. The degradation regime becomes a continued-growth regime. โ†’ Critical insights and the surrounding trace are complementary โ€” you need both โ†’ Attention to insights decays as CoT grows. This is the bottleneck โ†’ Replaying insights near the frontier shifts the optimal reasoning length rightward and raises the peak Works at pure inference time across 30B-tier models. No training required. Post-training on this pattern improves both stability and performance over vanilla CoT. ๐“๐ž๐ฌ๐ญ-๐ญ๐ข๐ฆ๐ž ๐ฌ๐œ๐š๐ฅ๐ข๐ง๐  ๐ข๐ฌ๐ง'๐ญ ๐ฃ๐ฎ๐ฌ๐ญ ๐š๐›๐จ๐ฎ๐ญ ๐ซ๐ž๐š๐ฌ๐จ๐ง๐ข๐ง๐  ๐ฅ๐จ๐ง๐ ๐ž๐ซ. ๐ˆ๐ญ'๐ฌ ๐š๐›๐จ๐ฎ๐ญ ๐ค๐ž๐ž๐ฉ๐ข๐ง๐  ๐ญ๐ก๐ž ๐ซ๐ข๐ ๐ก๐ญ ๐ฌ๐ญ๐š๐ญ๐ž ๐š๐œ๐œ๐ž๐ฌ๐ฌ๐ข๐›๐ฅ๐ž.

4:16 PM ยท May 15, 2026 ยท 6.7K Views
4:28 PM ยท May 15, 2026 ยท 461 Views

๐Ÿงต2/N

Why? We identify one driver of this decline: attention from the answer position to the trace's *critical insights* decays monotonically with distance.

Qwen3-8B: -19.2% end-to-end attention decay. Gemma-4-31B-it: -3.3% (its hybrid local/global attention helps).

The longer the model reasons, the less accessible its own earlier conclusions become.

Xin Eric Wang (hiring postdoc)Xin Eric Wang (hiring postdoc)@xwang_lk

๐Ÿงต1/N The folk story about chain-of-thought: longer reasoning โ†’ better accuracy. Recent studies show the curve isn't monotone. For a fixed problem, accuracy follows an inverted-U with CoT length: it rises, peaks, then declines. That peak puts a ceiling on what test-time scaling can buy you, and we'd like to push past it.

4:28 PM ยท May 15, 2026 ยท 461 Views
4:30 PM ยท May 15, 2026 ยท 2 Views

๐Ÿงต2/N: The attention decay problem in long CoT

We identify one driver of this decline: attention from the answer position to the trace's *critical insights* decays monotonically with distance.

Qwen3-8B: -19.2% end-to-end attention decay. Gemma-4-31B-it: -3.3% (its hybrid local/global attention helps).

The longer the model reasons, the less accessible its own earlier conclusions become.

Xin Eric Wang (hiring postdoc)Xin Eric Wang (hiring postdoc)@xwang_lk

๐Ÿงต1/N The folk story about chain-of-thought: longer reasoning โ†’ better accuracy. Recent studies show the curve isn't monotone. For a fixed problem, accuracy follows an inverted-U with CoT length: it rises, peaks, then declines. That peak puts a ceiling on what test-time scaling can buy you, and we'd like to push past it.

4:28 PM ยท May 15, 2026 ยท 461 Views
4:34 PM ยท May 15, 2026 ยท 289 Views

๐Ÿงต3/N: An old problem in new clothes: Analogy to LSTM

This is the same challenge sequence models have always faced: preserving information across long stretches of a sequence.

LSTMs solved it in 1997 (@SchmidhuberAI) with an explicit memory cell: a dedicated state that persists across time and gets continually updated as the sequence unfolds. The lesson wasn't "make recurrence stronger." It was "give the network a maintained memory."

Long CoT has the same shape of problem. The "sequence" is the reasoning trace. The "long-range information" is the critical insights produced earlier. Attention to them decays with distance.

So the natural question: can we equip a growing reasoning chain with an analogous mechanism โ€” one that keeps critical insights accessible as the chain extends?

That's what InsightReplay does. โ†“

Xin Eric Wang (hiring postdoc)Xin Eric Wang (hiring postdoc)@xwang_lk

๐Ÿงต2/N: The attention decay problem in long CoT We identify one driver of this decline: attention from the answer position to the trace's *critical insights* decays monotonically with distance. Qwen3-8B: -19.2% end-to-end attention decay. Gemma-4-31B-it: -3.3% (its hybrid local/global attention helps). The longer the model reasons, the less accessible its own earlier conclusions become.

4:34 PM ยท May 15, 2026 ยท 289 Views
5:00 PM ยท May 15, 2026 ยท 223 Views

๐Ÿงต4/N: InsightReplay, a stateful reasoning approach

Reasoning unfolds as two interleaving streams: โ†’ Reasoning chunks (R_t): step-by-step thinking โ†’ Insights (I_t): short distilled conclusions the model writes about its own prior reasoning

At each round t, the model (i) generates a reasoning chunk, then (ii) summarizes the conclusions reached so far into a new insight. The new insight is appended right before the next round of thinking begins.

Two consequences fall out: The latest insight always sits adjacent to the active generation frontier. Attention to critical conclusions stays intact no matter how long the chain grows. We're not making the model think more. We're keeping the important state close.

Each new insight is generated in the presence of all prior ones. So the insight trace isn't a flat list of summaries โ€” it's an evolving abstraction. New insights can correct, refine, or supersede earlier ones.

The reasoning state at any point isn't the raw trace. It's the cumulative set of distilled conclusions, kept near the position where the model is actually decoding.

Xin Eric Wang (hiring postdoc)Xin Eric Wang (hiring postdoc)@xwang_lk

๐Ÿงต3/N: An old problem in new clothes: Analogy to LSTM This is the same challenge sequence models have always faced: preserving information across long stretches of a sequence. LSTMs solved it in 1997 (@SchmidhuberAI) with an explicit memory cell: a dedicated state that persists across time and gets continually updated as the sequence unfolds. The lesson wasn't "make recurrence stronger." It was "give the network a maintained memory." Long CoT has the same shape of problem. The "sequence" is the reasoning trace. The "long-range information" is the critical insights produced earlier. Attention to them decays with distance. So the natural question: can we equip a growing reasoning chain with an analogous mechanism โ€” one that keeps critical insights accessible as the chain extends? That's what InsightReplay does. โ†“

5:00 PM ยท May 15, 2026 ยท 223 Views
5:05 PM ยท May 15, 2026 ยท 218 Views

๐Ÿงต5/N: A Theoretical Analysis of InsightReplay

The empirical inverted-U has a clean theoretical model. Wu et al. (2025) parameterize M-step CoT accuracy as a product over per-step success probabilities โ€” but assume each step's accuracy is independent of its position in the chain.

Section 2/N says that's wrong: attention to critical insights decays with distance.

We restore that dependence by introducing an insight accessibility function ฮธ(ฮด): โ†’ ฮธ(0) = 1 (adjacent โ†’ full access) โ†’ ฮธ strictly decreasing in ฮด โ†’ ฮธ(ฮด) โ†’ 0 as ฮด โ†’ โˆž

In standard CoT, the model pays a multiplicative ฮธ(t) penalty at every step โ€” and the penalty worsens as the chain extends. In InsightReplay, insights are relocated to a fixed near-frontier distance ฮด_0, so the penalty becomes the constant ฮธ(ฮด_0) > ฮธ(t) for all t โ‰ฅ 1.

Two theorems drop out: Theorem 1 โ€” InsightReplay shifts the optimal CoT length rightward: M_IR > M_ฮธ Theorem 2 โ€” InsightReplay raises the achievable peak accuracy: A_IR(M_IR) > A_ฮธ(M_ฮธ)

These map exactly onto the two phenomena from Figure 1: the peak moves right, and it goes up.

The proof doesn't say InsightReplay is the only solution. It says any mechanism that keeps ฮธ(ฮด) bounded above the natural decay floor reshapes the inverted-U in this direction.

Xin Eric Wang (hiring postdoc)Xin Eric Wang (hiring postdoc)@xwang_lk

๐Ÿงต4/N: InsightReplay, a stateful reasoning approach Reasoning unfolds as two interleaving streams: โ†’ Reasoning chunks (R_t): step-by-step thinking โ†’ Insights (I_t): short distilled conclusions the model writes about its own prior reasoning At each round t, the model (i) generates a reasoning chunk, then (ii) summarizes the conclusions reached so far into a new insight. The new insight is appended right before the next round of thinking begins. Two consequences fall out: The latest insight always sits adjacent to the active generation frontier. Attention to critical conclusions stays intact no matter how long the chain grows. We're not making the model think more. We're keeping the important state close. Each new insight is generated in the presence of all prior ones. So the insight trace isn't a flat list of summaries โ€” it's an evolving abstraction. New insights can correct, refine, or supersede earlier ones. The reasoning state at any point isn't the raw trace. It's the cumulative set of distilled conclusions, kept near the position where the model is actually decoding.

5:05 PM ยท May 15, 2026 ยท 218 Views
5:33 PM ยท May 15, 2026 ยท 150 Views

๐Ÿงต6/N: InsightReplay works, and longer thinking alone doesn't

24 settings (2 scales ร— 3 families ร— 4 benchmarks). 3-round InsightReplay: non-negative gains on every cell, +1.65 macro avg, largest +9.2 on R1-Distill-32B / LCB.

The control matters more than the headline.

"Verify-Only" gives the model the same extra tokens and a "wait, let me double-check" cue โ€” no insight extraction. It captures only +0.61 of the +1.65 gain. The remaining +1.04 (over 60%) comes from InsightReplay itself.

Longer thinking is not the active ingredient. Keeping critical state accessible is.

The inverted-U becomes continued growth: at 5 replay rounds on LCB, accuracy is still climbing where standard CoT has already collapsed.

Xin Eric Wang (hiring postdoc)Xin Eric Wang (hiring postdoc)@xwang_lk

๐Ÿงต5/N: A Theoretical Analysis of InsightReplay The empirical inverted-U has a clean theoretical model. Wu et al. (2025) parameterize M-step CoT accuracy as a product over per-step success probabilities โ€” but assume each step's accuracy is independent of its position in the chain. Section 2/N says that's wrong: attention to critical insights decays with distance. We restore that dependence by introducing an insight accessibility function ฮธ(ฮด): โ†’ ฮธ(0) = 1 (adjacent โ†’ full access) โ†’ ฮธ strictly decreasing in ฮด โ†’ ฮธ(ฮด) โ†’ 0 as ฮด โ†’ โˆž In standard CoT, the model pays a multiplicative ฮธ(t) penalty at every step โ€” and the penalty worsens as the chain extends. In InsightReplay, insights are relocated to a fixed near-frontier distance ฮด_0, so the penalty becomes the constant ฮธ(ฮด_0) > ฮธ(t) for all t โ‰ฅ 1. Two theorems drop out: Theorem 1 โ€” InsightReplay shifts the optimal CoT length rightward: M_IR > M_ฮธ Theorem 2 โ€” InsightReplay raises the achievable peak accuracy: A_IR(M_IR) > A_ฮธ(M_ฮธ) These map exactly onto the two phenomena from Figure 1: the peak moves right, and it goes up. The proof doesn't say InsightReplay is the only solution. It says any mechanism that keeps ฮธ(ฮด) bounded above the natural decay floor reshapes the inverted-U in this direction.

5:33 PM ยท May 15, 2026 ยท 150 Views
5:37 PM ยท May 15, 2026 ยท 147 Views

๐Ÿงต7/N: It also reinforces via RL post-training

We trained Qwen3-4B-Base with GRPO on DAPO-Math-15K. Baseline: standard CoT rollouts. InsightReplay: when the policy emits EOS, splice in a fixed continuation cue and let generation continue under the same length budget. Cue tokens are masked out of the loss โ€” the only signal is the reasoning pattern itself.

Three findings on AIME 2025:

1. Stability. The baseline degrades after step ~300: mean@32 drops 27.9% โ†’ 21.6%, best@32 falls 57.5% โ†’ 42.0%. InsightReplay stays stable through step 600.

2. Peak. InsightReplay wins on all three metrics. Largest gap on maj@32 (the consistency-sensitive metric): 38.8% vs 34.1%, +4.7 points.

3. Not an initialization effect. The first ~200 steps overlap within noise. Divergence emerges only after step ~250 โ€” the benefit is produced during training, not inherited from setup.

InsightReplay isn't just an inference trick. It's a reasoning pattern that can be reinforced as easily as it can be prompted.

Xin Eric Wang (hiring postdoc)Xin Eric Wang (hiring postdoc)@xwang_lk

๐Ÿงต6/N: InsightReplay works, and longer thinking alone doesn't 24 settings (2 scales ร— 3 families ร— 4 benchmarks). 3-round InsightReplay: non-negative gains on every cell, +1.65 macro avg, largest +9.2 on R1-Distill-32B / LCB. The control matters more than the headline. "Verify-Only" gives the model the same extra tokens and a "wait, let me double-check" cue โ€” no insight extraction. It captures only +0.61 of the +1.65 gain. The remaining +1.04 (over 60%) comes from InsightReplay itself. Longer thinking is not the active ingredient. Keeping critical state accessible is. The inverted-U becomes continued growth: at 5 replay rounds on LCB, accuracy is still climbing where standard CoT has already collapsed.

5:37 PM ยท May 15, 2026 ยท 147 Views
5:40 PM ยท May 15, 2026 ยท 346 Views

๐Ÿงต8/8

Test-time scaling has been framed as "let the model think longer." Our results suggest a sharper frame: it's about whether critical intermediate state stays accessible across long reasoning trajectories. Reasoning depth without state preservation hits a ceiling. Stateful reasoning moves it.

Project website: https://research.simular.ai/insight-replay/ Paper:ย  https://arxiv.org/abs/2605.14457 Code: https://github.com/simular-ai/insight-replay

Kudos to the InsightReplay team: Bin, Caiwen, Jiachen, @angli_ai, @xwang_lk. A joint collaboration between @SimularAI and U Minnesota.

Xin Eric Wang (hiring postdoc)Xin Eric Wang (hiring postdoc)@xwang_lk

๐Ÿงต7/N: It also reinforces via RL post-training We trained Qwen3-4B-Base with GRPO on DAPO-Math-15K. Baseline: standard CoT rollouts. InsightReplay: when the policy emits EOS, splice in a fixed continuation cue and let generation continue under the same length budget. Cue tokens are masked out of the loss โ€” the only signal is the reasoning pattern itself. Three findings on AIME 2025: 1. Stability. The baseline degrades after step ~300: mean@32 drops 27.9% โ†’ 21.6%, best@32 falls 57.5% โ†’ 42.0%. InsightReplay stays stable through step 600. 2. Peak. InsightReplay wins on all three metrics. Largest gap on maj@32 (the consistency-sensitive metric): 38.8% vs 34.1%, +4.7 points. 3. Not an initialization effect. The first ~200 steps overlap within noise. Divergence emerges only after step ~250 โ€” the benefit is produced during training, not inherited from setup. InsightReplay isn't just an inference trick. It's a reasoning pattern that can be reinforced as easily as it can be prompted.

5:40 PM ยท May 15, 2026 ยท 346 Views
5:45 PM ยท May 15, 2026 ยท 416 Views
InsightReplay Overcomes Accuracy Decay in Long Chain-of-Thought Reasoning ยท Digg