Good morning!
Looks like timeline is still talking about training algorithms for neural networks, particularly muon and shampoo.
Are we mentally prepared to talk about GRPO and RL?
CoreAutoAI co-founder Rohan Anil is stirring conversation by questioning whether the optimizer chatter around Muon and Shampoo is ready to move toward GRPO and broader reinforcement learning techniques, while Rishabh Agarwal counters with a simpler alternative that trades momentum for lower memory demands in noisy RL settings.
Good morning!
Looks like timeline is still talking about training algorithms for neural networks, particularly muon and shampoo.
Are we mentally prepared to talk about GRPO and RL?
Participants highlight GRPO's group-based advantage estimation that skips a separate critic model, yet no timelines or scale of any community pivot are confirmed, leaving the actual shift speculative.
The proposal notes SignSGD's compression benefits in noisy environments without specifying current RL benchmarks, so its practical edge stays at the level of informed suggestion rather than proven result.
Many users encouraged more discussion of GRPO and RL readiness after the Muon Shampoo buzz because they found the arguments and experiments engaging, while one worried about added training costs.
It would be funny if it turns out SignSGD (no momentum, no second order stuff) is good enough for RL (because fast and low memory utilization) and gradient updates are noisy anyways
Good morning!
Looks like timeline is still talking about training algorithms for neural networks, particularly muon and shampoo.
Are we mentally prepared to talk about GRPO and RL?
@_arohan_ A modest proposal: in order to determine credit assignment in RL, we should simply derive an advantage function for algorithm and implementation diffs.
Good morning!
Looks like timeline is still talking about training algorithms for neural networks, particularly muon and shampoo.
Are we mentally prepared to talk about GRPO and RL?
@ziv_ravid Variance reduction argument is main one I buy. From feature learning perspective, may not so much
@_arohan_ What do you think about on policy self-distillation? 🧐

@agarwl_ Actually we found that vanilla SGD ( and later signSGD w/o momentum) is good enough for RL.
Evidence for SGD:
@jachiam0 We must teach these models that. They seem to have catastrophic forgetting
@_arohan_ A modest proposal: in order to determine credit assignment in RL, we should simply derive an advantage function for algorithm and implementation diffs.
@_arohan_ What do you think about on policy self-distillation? 🧐
Good morning!
Looks like timeline is still talking about training algorithms for neural networks, particularly muon and shampoo.
Are we mentally prepared to talk about GRPO and RL?
@agarwl_ AgarwalModdedRL when!?
It would be funny if it turns out SignSGD (no momentum, no second order stuff) is good enough for RL (because fast and low memory utilization) and gradient updates are noisy anyways
@_arohan_ Do you think the on policy is an important part?
@ziv_ravid Variance reduction argument is main one I buy. From feature learning perspective, may not so much
@agarwl_ afaik glm5 was trainer like this
It would be funny if it turns out SignSGD (no momentum, no second order stuff) is good enough for RL (because fast and low memory utilization) and gradient updates are noisy anyways

@_arohan_ @HessianFree 🍿
@jachiam0 @_arohan_ BREAKING NEWS: OPENAI EMPLOYEE LEAK ALPHA ON X ON HOW THEY DO RL WHITOUT VERIFIABLE REWARDS
@_arohan_ A modest proposal: in order to determine credit assignment in RL, we should simply derive an advantage function for algorithm and implementation diffs.

@_arohan_ The discussion, arguments & runs were super interesting! You should share more what you do

@_arohan_ would like to see more discussions about it
@ziv_ravid Isn’t it a question of what conditioning you provide, which is the new information that is added?
@_arohan_ Do you think the on policy is an important part?

@_arohan_ Yes sir .... Drop it..

@_arohan_ Hahaha I have things to say that I haven't said publicly for a reason 😂😂

@_arohan_ please keep the hot takes coming 🙏🏾

@_arohan_ Born ready Release the environment!!!!

@_arohan_ It feels like you want to be more conservative with the preconditioner in these noisy settings. Related: we recently wrote up some ideas on how Muon might be adapted to noisy settings:

@_arohan_ Let's do it
CoreAutoAI co-founder Rohan Anil is stirring conversation by questioning whether the optimizer chatter around Muon and Shampoo is ready to move toward GRPO and broader reinforcement learning techniques, while Rishabh Agarwal counters with a simpler alternative that trades momentum for lower memory demands in noisy RL settings.
Good morning!
Looks like timeline is still talking about training algorithms for neural networks, particularly muon and shampoo.
Are we mentally prepared to talk about GRPO and RL?