/Tech4h ago

CoreAutoAI's Rohan Anil asks if ML optimization is shifting to GRPO, while Rishabh Agarwal proposes simpler SignSGD for RL workloads

Story Overview

CoreAutoAI co-founder Rohan Anil is stirring conversation by questioning whether the optimizer chatter around Muon and Shampoo is ready to move toward GRPO and broader reinforcement learning techniques, while Rishabh Agarwal counters with a simpler alternative that trades momentum for lower memory demands in noisy RL settings.

2630055930K

#86

Original post

rohan anil@_arohan_#86inTech

Good morning!

Looks like timeline is still talking about training algorithms for neural networks, particularly muon and shampoo.

Are we mentally prepared to talk about GRPO and RL?

8:46 AM · Jun 11, 2026 · 16.5K Views

/Tech4h ago

CoreAutoAI's Rohan Anil asks if ML optimization is shifting to GRPO, while Rishabh Agarwal proposes simpler SignSGD for RL workloads

Story Overview

2630055930K

#86

Original post

rohan anil@_arohan_#86inTech

Good morning!

Looks like timeline is still talking about training algorithms for neural networks, particularly muon and shampoo.

Are we mentally prepared to talk about GRPO and RL?

8:46 AM · Jun 11, 2026 · 16.5K Views

Open Question

GRPO readiness remains an open debate

Participants highlight GRPO's group-based advantage estimation that skips a separate critic model, yet no timelines or scale of any community pivot are confirmed, leaving the actual shift speculative.

Developer Impact

SignSGD trades momentum for efficiency

The proposal notes SignSGD's compression benefits in noisy environments without specifying current RL benchmarks, so its practical edge stays at the level of informed suggestion rather than proven result.

Sentiment

Many users expressed excitement for more RL training content on methods like GRPO after Muon Shampoo buzz and SignSGD suggestions, praising the rigorous arguments and urging releases.

Pos

100.0%

Neg

0.0%

7 comments with sentiment.

Cluster Engagement

Posts from X

Most Activity

VIEWS7.1KBOOKMARKS29LIKES60RETWEETS2REPLIES3

Rishabh Agarwal@agarwl_

It would be funny if it turns out SignSGD (no momentum, no second order stuff) is good enough for RL (because fast and low memory utilization) and gradient updates are noisy anyways

rohan anil@_arohan_

Good morning!

Looks like timeline is still talking about training algorithms for neural networks, particularly muon and shampoo.

Are we mentally prepared to talk about GRPO and RL?

4h7.1K6029

rohan anil@_arohan_

@ziv_ravid Variance reduction argument is main one I buy. From feature learning perspective, may not so much

Ravid Shwartz Ziv@ziv_ravid

@_arohan_ What do you think about on policy self-distillation? 🧐

4h1K134

Joshua Achiam@jachiam0

@_arohan_ A modest proposal: in order to determine credit assignment in RL, we should simply derive an advantage function for algorithm and implementation diffs.

rohan anil@_arohan_

Good morning!

Looks like timeline is still talking about training algorithms for neural networks, particularly muon and shampoo.

Are we mentally prepared to talk about GRPO and RL?

4h1.7K140

Sagnik@saagnikkk

@agarwl_ Actually we found that vanilla SGD ( and later signSGD w/o momentum) is good enough for RL.

Evidence for SGD:

4h6911

rohan anil@_arohan_

@jachiam0 We must teach these models that. They seem to have catastrophic forgetting

Joshua Achiam@jachiam0

@_arohan_ A modest proposal: in order to determine credit assignment in RL, we should simply derive an advantage function for algorithm and implementation diffs.

4h1K80

Ravid Shwartz Ziv@ziv_ravid

@_arohan_ What do you think about on policy self-distillation? 🧐

rohan anil@_arohan_

Good morning!

Looks like timeline is still talking about training algorithms for neural networks, particularly muon and shampoo.

Are we mentally prepared to talk about GRPO and RL?

4h1.3K20

samsja@samsja19

@agarwl_ afaik glm5 was trainer like this

Rishabh Agarwal@agarwl_

It would be funny if it turns out SignSGD (no momentum, no second order stuff) is good enough for RL (because fast and low memory utilization) and gradient updates are noisy anyways

3h33760

rohan anil@_arohan_

@agarwl_ AgarwalModdedRL when!?

Rishabh Agarwal@agarwl_

It would be funny if it turns out SignSGD (no momentum, no second order stuff) is good enough for RL (because fast and low memory utilization) and gradient updates are noisy anyways

4h60640

Ravid Shwartz Ziv@ziv_ravid

@_arohan_ Do you think the on policy is an important part?

rohan anil@_arohan_

@ziv_ravid Variance reduction argument is main one I buy. From feature learning perspective, may not so much

4h37100

stochasm@stochasticchasm

@_arohan_ @HessianFree 🍿

4h1474

Yann Viegas@_Yann77

@_arohan_ Hot take people are not ready for: random perturbations is competitive with RL for a fixed task and I think much more efficient search algorithms should exist

19m201

elie@eliebakouch

@jachiam0 @_arohan_ BREAKING NEWS: OPENAI EMPLOYEE LEAK ALPHA ON X ON HOW THEY DO RL WHITOUT VERIFIABLE REWARDS

Joshua Achiam@jachiam0

@_arohan_ A modest proposal: in order to determine credit assignment in RL, we should simply derive an advantage function for algorithm and implementation diffs.

3h19730

rohan anil@_arohan_

@ziv_ravid Isn’t it a question of what conditioning you provide, which is the new information that is added?

Ravid Shwartz Ziv@ziv_ravid

@_arohan_ Do you think the on policy is an important part?

4h31820

Pierre Bongrand@bongrandp

@_arohan_ The discussion, arguments & runs were super interesting! You should share more what you do

3h671

xeon@saymycodename

@_arohan_ would like to see more discussions about it

4h201

Kratius@Kratius1

@_arohan_ Yes sir .... Drop it..

4h160

Aakash Kumar Nain@A_K_Nain

@_arohan_ Hahaha I have things to say that I haven't said publicly for a reason 😂😂

4h140

Ahmed Ahmed@AhmedSQRD

@_arohan_ please keep the hot takes coming 🙏🏾

4h124

Mohammed Alshehri@SwishMoe

@_arohan_ Born ready Release the environment!!!!

4h107

Edward Milsom@edward_milsom

@_arohan_ It feels like you want to be more conservative with the preconditioner in these noisy settings. Related: we recently wrote up some ideas on how Muon might be adapted to noisy settings:

3h63