The RAGEN paper shows agents reason less after RL, which is counterintuitive to me. I'm wondering if this is an artifact of synthetic environments / vanilla GRPO rather than a legit pattern, but biases in GRPO tend to inflate response lengths rather than shorten them.
1:15 PM · Jun 10, 2026 · 916 Views
Sentiment
Sentiment building, check back later.
Cluster Engagement
Posts from X
Most Activity
Most Activity
VIEWS456BOOKMARKS2LIKES2
Cameron R. Wolfe, Ph.D.@cwolferesearch
link to paper: https://arxiv.org/abs/2504.20073
Cameron R. Wolfe, Ph.D.@cwolferesearch
The RAGEN paper shows agents reason less after RL, which is counterintuitive to me. I'm wondering if this is an artifact of synthetic environments / vanilla GRPO rather than a legit pattern, but biases in GRPO tend to inflate response lengths rather than shorten them.
2hViews 456Likes 2Bookmarks 2

Alex YGift@Radipdegen
@cwolferesearch might be overfitting to synthetic reward patterns rather than actual reasoning
2h