@jsuarez @ChenTessler Are there some blogs or papers on that? Would be very interested in cases where Muon is helpful in RL, my impression so far has been that optimizers matter more in pre-training.
@ChenTessler It was a major component of Puffer 3. We reswept hypers for ~10 tasks with Muon vs Adam and the gap was quite clear. Try the PufferNet arch over LSTM next!