Original post
Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)#450
𝚐𝔪𝟾𝚡𝚡𝟾@gm8xx8
ThoughtFold has a clean RLVR angle: correct long CoTs contain both useful reasoning and redundant exploration, but outcome rewards reinforce all of it. Instead of just rewarding shorter answers, it prunes correct chains, verifies what can be removed, then uses masked preference learning to penalize redundant steps and keep the reasoning path tighter.
Paper: https://arxiv.org/abs/2606.03503
12:18 PM · Jun 5, 2026 · 4.3K Views