Really cool paper and it actually made me realize something.
Addressing the permutation invariance symmetry seems properly useful, and this feels like it drives even more complexity for attention specifically I’d imagine even the early phases of training.
While MLPs have per-neuron permutation invariance, attention heads are a bit more complex in that
- Permuting features within a given head is equivalent
- BUT the same permutation must be done on Q K and V
- then permuting heads themselves is fine, similar to observations around MoE experts
One thing I’d want to understand better is I imagine permutation of dimensions is pretty easily equivalent in descent? Though arbitrary rotations of basis might have tangible effects on how optimizers like Adam which consider elementwise dynamics work