/Tech4h ago

CoreAutoAI's Rohan Anil asks if ML optimization is shifting to GRPO, while Rishabh Agarwal proposes simpler SignSGD for RL workloads

Story Overview

CoreAutoAI co-founder Rohan Anil is stirring conversation by questioning whether the optimizer chatter around Muon and Shampoo is ready to move toward GRPO and broader reinforcement learning techniques, while Rishabh Agarwal counters with a simpler alternative that trades momentum for lower memory demands in noisy RL settings.

2630055930K
Original post
rohan anil@_arohan_#86inTech

Good morning!

Looks like timeline is still talking about training algorithms for neural networks, particularly muon and shampoo.

Are we mentally prepared to talk about GRPO and RL?

8:46 AM · Jun 11, 2026 · 16.5K Views
Open Question

GRPO readiness remains an open debate

Participants highlight GRPO's group-based advantage estimation that skips a separate critic model, yet no timelines or scale of any community pivot are confirmed, leaving the actual shift speculative.

Developer Impact

SignSGD trades momentum for efficiency

The proposal notes SignSGD's compression benefits in noisy environments without specifying current RL benchmarks, so its practical edge stays at the level of informed suggestion rather than proven result.

Sentiment

Many users expressed excitement for more RL training content on methods like GRPO after Muon Shampoo buzz and SignSGD suggestions, praising the rigorous arguments and urging releases.

Pos
100.0%
Neg
0.0%
7 comments with sentiment.
Cluster Engagement
Posts from X
Most Activity
Most Activity
VIEWS7.1KBOOKMARKS29LIKES60RETWEETS2REPLIES3

It would be funny if it turns out SignSGD (no momentum, no second order stuff) is good enough for RL (because fast and low memory utilization) and gradient updates are noisy anyways

rohan anil@_arohan_

Good morning!

Looks like timeline is still talking about training algorithms for neural networks, particularly muon and shampoo.

Are we mentally prepared to talk about GRPO and RL?

4hViews 7.1KLikes 60Bookmarks 29
rohan anil@_arohan_

@ziv_ravid Variance reduction argument is main one I buy. From feature learning perspective, may not so much

@_arohan_ What do you think about on policy self-distillation? 🧐

4hViews 1KLikes 13Bookmarks 4
Joshua Achiam@jachiam0

@_arohan_ A modest proposal: in order to determine credit assignment in RL, we should simply derive an advantage function for algorithm and implementation diffs.

rohan anil@_arohan_

Good morning!

Looks like timeline is still talking about training algorithms for neural networks, particularly muon and shampoo.

Are we mentally prepared to talk about GRPO and RL?

4hViews 1.7KLikes 14Bookmarks 0
Sagnik@saagnikkk

@agarwl_ Actually we found that vanilla SGD ( and later signSGD w/o momentum) is good enough for RL.

Evidence for SGD:

4hViews 69Likes 1Bookmarks 1
rohan anil@_arohan_

@jachiam0 We must teach these models that. They seem to have catastrophic forgetting

Joshua Achiam@jachiam0

@_arohan_ A modest proposal: in order to determine credit assignment in RL, we should simply derive an advantage function for algorithm and implementation diffs.

4hViews 1KLikes 8Bookmarks 0

@_arohan_ What do you think about on policy self-distillation? 🧐

rohan anil@_arohan_

Good morning!

Looks like timeline is still talking about training algorithms for neural networks, particularly muon and shampoo.

Are we mentally prepared to talk about GRPO and RL?

4hViews 1.3KLikes 2Bookmarks 0
samsja@samsja19

@agarwl_ afaik glm5 was trainer like this

It would be funny if it turns out SignSGD (no momentum, no second order stuff) is good enough for RL (because fast and low memory utilization) and gradient updates are noisy anyways

3hViews 337Likes 6Bookmarks 0
rohan anil@_arohan_

@agarwl_ AgarwalModdedRL when!?

It would be funny if it turns out SignSGD (no momentum, no second order stuff) is good enough for RL (because fast and low memory utilization) and gradient updates are noisy anyways

4hViews 606Likes 4Bookmarks 0

@_arohan_ Do you think the on policy is an important part?

rohan anil@_arohan_

@ziv_ravid Variance reduction argument is main one I buy. From feature learning perspective, may not so much

4hViews 371Likes 0Bookmarks 0
stochasm@stochasticchasm

@_arohan_ @HessianFree 🍿

4hViews 147Likes 4
Yann Viegas@_Yann77

@_arohan_ Hot take people are not ready for: random perturbations is competitive with RL for a fixed task and I think much more efficient search algorithms should exist

19mViews 20Likes 1
elie@eliebakouch

@jachiam0 @_arohan_ BREAKING NEWS: OPENAI EMPLOYEE LEAK ALPHA ON X ON HOW THEY DO RL WHITOUT VERIFIABLE REWARDS

Joshua Achiam@jachiam0

@_arohan_ A modest proposal: in order to determine credit assignment in RL, we should simply derive an advantage function for algorithm and implementation diffs.

3hViews 197Likes 3Bookmarks 0
rohan anil@_arohan_

@ziv_ravid Isn’t it a question of what conditioning you provide, which is the new information that is added?

@_arohan_ Do you think the on policy is an important part?

4hViews 318Likes 2Bookmarks 0
Pierre Bongrand@bongrandp

@_arohan_ The discussion, arguments & runs were super interesting! You should share more what you do

3hViews 67Likes 1
xeon@saymycodename

@_arohan_ would like to see more discussions about it

4hViews 201
Kratius@Kratius1

@_arohan_ Yes sir .... Drop it..

4hViews 160

@_arohan_ Hahaha I have things to say that I haven't said publicly for a reason 😂😂

4hViews 140
Ahmed Ahmed@AhmedSQRD

@_arohan_ please keep the hot takes coming 🙏🏾

4hViews 124

@_arohan_ Born ready Release the environment!!!!

4hViews 107
Edward Milsom@edward_milsom

@_arohan_ It feels like you want to be more conservative with the preconditioner in these noisy settings. Related: we recently wrote up some ideas on how Muon might be adapted to noisy settings:

3hViews 63
Load more posts