/Tech1h ago

CoreAutoAI's Rohan Anil asks if ML optimization is shifting to GRPO, while Rishabh Agarwal proposes simpler SignSGD for RL workloads

Story Overview

CoreAutoAI co-founder Rohan Anil is stirring conversation by questioning whether the optimizer chatter around Muon and Shampoo is ready to move toward GRPO and broader reinforcement learning techniques, while Rishabh Agarwal counters with a simpler alternative that trades momentum for lower memory demands in noisy RL settings.

2420223215.3K

#86

Original post

rohan anil@_arohan_#86inTech

Good morning!

Looks like timeline is still talking about training algorithms for neural networks, particularly muon and shampoo.

Are we mentally prepared to talk about GRPO and RL?

8:46 AM · Jun 11, 2026 · 9K Views

/Tech1h ago

CoreAutoAI's Rohan Anil asks if ML optimization is shifting to GRPO, while Rishabh Agarwal proposes simpler SignSGD for RL workloads

Story Overview

2420223215.3K

#86

Original post

rohan anil@_arohan_#86inTech

Good morning!

Looks like timeline is still talking about training algorithms for neural networks, particularly muon and shampoo.

Are we mentally prepared to talk about GRPO and RL?

8:46 AM · Jun 11, 2026 · 9K Views

Open Question

GRPO readiness remains an open debate

Participants highlight GRPO's group-based advantage estimation that skips a separate critic model, yet no timelines or scale of any community pivot are confirmed, leaving the actual shift speculative.

Developer Impact

SignSGD trades momentum for efficiency

The proposal notes SignSGD's compression benefits in noisy environments without specifying current RL benchmarks, so its practical edge stays at the level of informed suggestion rather than proven result.

Sentiment

Many users encouraged more discussion of GRPO and RL readiness after the Muon Shampoo buzz because they found the arguments and experiments engaging, while one worried about added training costs.

Pos

85.7%

Neg

14.3%

7 comments with sentiment.

Cluster Engagement

Posts from X

Most Activity

VIEWS3.6KBOOKMARKS15LIKES37REPLIES3

Rishabh Agarwal@agarwl_

It would be funny if it turns out SignSGD (no momentum, no second order stuff) is good enough for RL (because fast and low memory utilization) and gradient updates are noisy anyways

rohan anil@_arohan_

Good morning!

Looks like timeline is still talking about training algorithms for neural networks, particularly muon and shampoo.

Are we mentally prepared to talk about GRPO and RL?

1h3.6K3715

RETWEETS1

Joshua Achiam@jachiam0

@_arohan_ A modest proposal: in order to determine credit assignment in RL, we should simply derive an advantage function for algorithm and implementation diffs.

rohan anil@_arohan_

Good morning!

Looks like timeline is still talking about training algorithms for neural networks, particularly muon and shampoo.

Are we mentally prepared to talk about GRPO and RL?

59m938100

rohan anil@_arohan_

@ziv_ravid Variance reduction argument is main one I buy. From feature learning perspective, may not so much

Ravid Shwartz Ziv@ziv_ravid

@_arohan_ What do you think about on policy self-distillation? 🧐

1h654101

Sagnik@saagnikkk

@agarwl_ Actually we found that vanilla SGD ( and later signSGD w/o momentum) is good enough for RL.

Evidence for SGD:

1h6911

rohan anil@_arohan_

@jachiam0 We must teach these models that. They seem to have catastrophic forgetting

Joshua Achiam@jachiam0

@_arohan_ A modest proposal: in order to determine credit assignment in RL, we should simply derive an advantage function for algorithm and implementation diffs.

57m56960

Ravid Shwartz Ziv@ziv_ravid

@_arohan_ What do you think about on policy self-distillation? 🧐

rohan anil@_arohan_

Good morning!

Looks like timeline is still talking about training algorithms for neural networks, particularly muon and shampoo.

Are we mentally prepared to talk about GRPO and RL?

1h86510

rohan anil@_arohan_

@agarwl_ AgarwalModdedRL when!?

Rishabh Agarwal@agarwl_

It would be funny if it turns out SignSGD (no momentum, no second order stuff) is good enough for RL (because fast and low memory utilization) and gradient updates are noisy anyways

1h45140

Ravid Shwartz Ziv@ziv_ravid

@_arohan_ Do you think the on policy is an important part?

rohan anil@_arohan_

@ziv_ravid Variance reduction argument is main one I buy. From feature learning perspective, may not so much

1h22900

samsja@samsja19

@agarwl_ afaik glm5 was trainer like this

Rishabh Agarwal@agarwl_

It would be funny if it turns out SignSGD (no momentum, no second order stuff) is good enough for RL (because fast and low memory utilization) and gradient updates are noisy anyways

30m21740

stochasm@stochasticchasm

@_arohan_ @HessianFree 🍿

1h1474

elie@eliebakouch

@jachiam0 @_arohan_ BREAKING NEWS: OPENAI EMPLOYEE LEAK ALPHA ON X ON HOW THEY DO RL WHITOUT VERIFIABLE REWARDS

Joshua Achiam@jachiam0

@_arohan_ A modest proposal: in order to determine credit assignment in RL, we should simply derive an advantage function for algorithm and implementation diffs.

29m9910

Pierre Bongrand@bongrandp

@_arohan_ The discussion, arguments & runs were super interesting! You should share more what you do

21m671

xeon@saymycodename

@_arohan_ would like to see more discussions about it

1h201

rohan anil@_arohan_

@ziv_ravid Isn’t it a question of what conditioning you provide, which is the new information that is added?

Ravid Shwartz Ziv@ziv_ravid

@_arohan_ Do you think the on policy is an important part?

1h17300

Kratius@Kratius1

@_arohan_ Yes sir .... Drop it..

1h160

Aakash Kumar Nain@A_K_Nain

@_arohan_ Hahaha I have things to say that I haven't said publicly for a reason 😂😂

1h140

Ahmed Ahmed@AhmedSQRD

@_arohan_ please keep the hot takes coming 🙏🏾

1h124

Mohammed Alshehri@SwishMoe

@_arohan_ Born ready Release the environment!!!!

1h107

Edward Milsom@edward_milsom

@_arohan_ It feels like you want to be more conservative with the preconditioner in these noisy settings. Related: we recently wrote up some ideas on how Muon might be adapted to noisy settings:

34m63

Sahand Sharifzadeh@sahandsharif

@_arohan_ Let's do it

1h34