@kellerjordan0 You get a different horizon with learning rate decay?
@_arohan_ This is an interesting result. I'm just saying compare against the real baseline please! There's no rerunning necessary.
@kellerjordan0 You get a different horizon with learning rate decay?
@_arohan_ This is an interesting result. I'm just saying compare against the real baseline please! There's no rerunning necessary.
@kellerjordan0 I ran these with 3375 end steps. Because thats what I saw in from the logs I was using. I am okay with this.
@_arohan_ Sorry wdym? The main difference is a better-tuned learning rate compared to the old baseline.
Both your logfile above and the proper Muon baseline would qualify as 3325-step runs, since your logfile hits <3.28 at that time. Tho yours would need more stat sig, but that's trivial.

@_arohan_ Ok, well your comparison is between an undertuned Muon and a fully-tuned Shampoo.
I guess you used result #6 logs instead of #12? Not a massive deal, but I'll make the fair comparison when I do a post

@_arohan_ Sorry wdym? The main difference is a better-tuned learning rate compared to the old baseline.
Both your logfile above and the proper Muon baseline would qualify as 3325-step runs, since your logfile hits <3.28 at that time. Tho yours would need more stat sig, but that's trivial.
@kellerjordan0 You get a different horizon with learning rate decay?
@_arohan_ This is an interesting result. I'm just saying compare against the real baseline please! There's no rerunning necessary.