/Tech1h ago

Prime Intellect's @kalomaze highlights GLM 5.2's oscillating chain-of-thought trace, prompting calls for RL-tuned non-verbose models

Story Overview

GLM 5.2's visible thinking mode can stall on tiny choices, as seen when the model waffled between tightening a decision threshold at 0.3 versus 0.4 before pivoting to finish its report. The trace, shared by a Prime Intellect researcher, captures the kind of repetitive internal monologue that surfaces during long-horizon tasks even though the model supports explicit reasoning-effort controls.

429012.7K

#1213

Original post

kalomaze@kalomaze#1213inTech

> "should I tighten the threshold to 0.3? Let me check. Actually 0.4. Actually let me move on. Actually wait let me also tighten. Actually the report is more important. Actually let me re-check" - GLM 5.2, CoT

1:40 AM · Jun 24, 2026 · 1.8K Views

Open Question

RL tuning could trim the monologues

A fellow engineer suggested training a less chatty variant through targeted reinforcement learning, since the current agentic post-training already emphasizes trajectory compaction yet left room for this oscillation. No official non-verbose checkpoint has appeared so far.

Developer Impact

Users already spot the pattern in practice

Early reports note similar extended deliberation on straightforward prompts, raising the practical question of when to switch from Max thinking mode to direct answers for everyday work.

Sentiment

Sentiment building, check back later.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Posts from X

Most Activity

VIEWS972BOOKMARKS1LIKES11REPLIES2

Alexander Doria@Dorialexander

someone should really RL-tune a non-verbose GLM…

kalomaze@kalomaze

1h972111

rohit@krishnanrohit

@Dorialexander GLM 5.2q for quiet

Alexander Doria@Dorialexander

someone should really RL-tune a non-verbose GLM…

42m8310

Healthy Anon@arimedai

@Dorialexander Someone will. But brevity RL trades visible indecision for invisible shortcuts. Sometimes the model just learns to be confidently wrong faster.

17m1