6h ago

Tim Tsz-Kit Lau and Weijie Su propose a symmetry-compatible principle for optimizer design that respects permutation invariance across embeddings, LM heads, SwiGLU MLPs, and MoE routers

May 19 2026 paper ships with equivariant optimizers GitHub repository.

1221141.6K

——0——

Original post

Really cool paper and it actually made me realize something. Addressing the permutation invariance symmetry seems properly useful, and this feels like it drives even more complexity for attention specifically I’d imagine even the early phases of training. While MLPs have per-neuron permutation invariance, attention heads are a bit more complex in that - Permuting features within a given head is equivalent - BUT the same permutation must be done on Q K and V - then permuting heads themselves is fine, similar to observations around MoE experts One thing I’d want to understand better is I imagine permutation of dimensions is pretty easily equivalent in descent? Though arbitrary rotations of basis might have tangible effects on how optimizers like Adam which consider elementwise dynamics work

12:38 PM · May 20, 2026

Reposted by

#1502@WEIJIE444

Tim Tsz-Kit Lau and Weijie Su propose a symmetry-compatible principle for optimizer design that respects permutation invariance across embeddings, LM heads, SwiGLU MLPs, and MoE routers

Sentiment

Cluster engagement