Prime Intellect's @kalomaze highlights GLM 5.2's oscillating chain-of-thought trace, prompting calls for RL-tuned non-verbose models
Story Overview
GLM 5.2's visible thinking mode can stall on tiny choices, as seen when the model waffled between tightening a decision threshold at 0.3 versus 0.4 before pivoting to finish its report. The trace, shared by a Prime Intellect researcher, captures the kind of repetitive internal monologue that surfaces during long-horizon tasks even though the model supports explicit reasoning-effort controls.
RL tuning could trim the monologues
A fellow engineer suggested training a less chatty variant through targeted reinforcement learning, since the current agentic post-training already emphasizes trajectory compaction yet left room for this oscillation. No official non-verbose checkpoint has appeared so far.
Users already spot the pattern in practice
Early reports note similar extended deliberation on straightforward prompts, raising the practical question of when to switch from Max thinking mode to direct answers for everyday work.
No Digg Deeper questions have been answered for this story yet.
Most Activity
someone should really RL-tune a non-verbose GLM…
> "should I tighten the threshold to 0.3? Let me check. Actually 0.4. Actually let me move on. Actually wait let me also tighten. Actually the report is more important. Actually let me re-check" - GLM 5.2, CoT
@Dorialexander GLM 5.2q for quiet
someone should really RL-tune a non-verbose GLM…

@Dorialexander Someone will. But brevity RL trades visible indecision for invisible shortcuts. Sometimes the model just learns to be confidently wrong faster.