/AI20h ago

Tilde releases Compositional Muon, stabilizing transformer optimization by applying partner-whitened updates to composed QK and OV matrices

It shares core mechanics with the prior LoRA-RITE algorithm.

253904928973.7K
Original postAryaman Arora#669
Tilde@tilderesearch

Introducing Compositional Muon, an optimizer that extends Muon from individual matrices to composed transformer circuits.

Modern optimizers usually draw trust regions around individual parameters. But in attention, the loss often sees compositions like QK^T and OV. Updating each factor independently can therefore control the wrong object. Compositional Muon closes this gap by deriving partner-whitened update rules. Each factor’s update is shaped by the spectral geometry of the matrix it is composed with, producing more stable composed updates and better effective learning-rate allocation across heads and layers.

For QK, this gives a head-local half-split rule. For OV, the circuit geometry selects a hybrid rule: (V) is optimized per-head, while (W_O) is optimized as the single matrix that aggregates all heads back into the residual stream.

CM improves over Muon at 340M and 1B scale, transfers to the modded-nanoGPT optimization benchmark, and can be approximated cheaply as partner-rescaled Muon via the isotropic rule.

The broader point is optimizer-architecture co-design: better optimizers should not only ask how to update a parameter, but what composed circuit that parameter participates in. CM is one step toward optimizers that respect the functional structure the loss actually sees.

9:11 AM · Jun 5, 2026 · 52.6K Views
Sentiment

Users are excited about Tilde Research's Compositional Muon Optimizer for Transformers because it looks like cool, promising work that could help with LoRA optimization and related techniques.

Pos
100.0%
Neg
0.0%
5 comments with sentiment.
Cluster Engagement
Posts from X
Most Activity
Most Activity
VIEWS14.6KBOOKMARKS46LIKES52REPLIES3

Just read this, nice research. We did something similar long back on LoRA factors.

https://arxiv.org/pdf/2410.20625

Tilde@tilderesearch

Introducing Compositional Muon, an optimizer that extends Muon from individual matrices to composed transformer circuits.

Modern optimizers usually draw trust regions around individual parameters. But in attention, the loss often sees compositions like QK^T and OV. Updating each factor independently can therefore control the wrong object. Compositional Muon closes this gap by deriving partner-whitened update rules. Each factor’s update is shaped by the spectral geometry of the matrix it is composed with, producing more stable composed updates and better effective learning-rate allocation across heads and layers.

For QK, this gives a head-local half-split rule. For OV, the circuit geometry selects a hybrid rule: (V) is optimized per-head, while (W_O) is optimized as the single matrix that aggregates all heads back into the residual stream.

CM improves over Muon at 340M and 1B scale, transfers to the modded-nanoGPT optimization benchmark, and can be approximated cheaply as partner-rescaled Muon via the isotropic rule.

The broader point is optimizer-architecture co-design: better optimizers should not only ask how to update a parameter, but what composed circuit that parameter participates in. CM is one step toward optimizers that respect the functional structure the loss actually sees.

9hViews 14.6KLikes 52Bookmarks 46
RETWEETS31
Tilde@tilderesearch

Introducing Compositional Muon, an optimizer that extends Muon from individual matrices to composed transformer circuits.

Modern optimizers usually draw trust regions around individual parameters. But in attention, the loss often sees compositions like QK^T and OV. Updating each factor independently can therefore control the wrong object. Compositional Muon closes this gap by deriving partner-whitened update rules. Each factor’s update is shaped by the spectral geometry of the matrix it is composed with, producing more stable composed updates and better effective learning-rate allocation across heads and layers.

For QK, this gives a head-local half-split rule. For OV, the circuit geometry selects a hybrid rule: (V) is optimized per-head, while (W_O) is optimized as the single matrix that aggregates all heads back into the residual stream.

CM improves over Muon at 340M and 1B scale, transfers to the modded-nanoGPT optimization benchmark, and can be approximated cheaply as partner-rescaled Muon via the isotropic rule.

The broader point is optimizer-architecture co-design: better optimizers should not only ask how to update a parameter, but what composed circuit that parameter participates in. CM is one step toward optimizers that respect the functional structure the loss actually sees.

20hViews 52.6KLikes 296Bookmarks 223
rohan anil@_arohan_

Nobody out preconditions Sai @dvsaisurya

Just read this, nice research. We did something similar long back on LoRA factors.

https://arxiv.org/pdf/2410.20625

7hViews 7KLikes 44Bookmarks 24
Tilde@tilderesearch

Read the full post here: https://blog.tilderesearch.com/blog/compositional-muon

20hViews 1.1KLikes 16Bookmarks 3
nor@norxornor

cool work! i was recently working on understanding the behavior of deep linear subnetwork reparameterization of linear layers - typical optimization with them (e.g. with weight decay) leads to something resembling a nuclear norm penalty on the weights (the visible spectral gap can be seen as an artifact of that) and also makes updates slower than they can be (bounds using single layer worst case twice are weaker than actual steepest descent bounds), and keeping in mind the factorized structure can help improve both, as in this approach

19hViews 578Likes 9
ueaj@_ueaj

@dvsaisurya oh perfect we were looking for a lora optimizer actually cc @afrenkai

8hViews 79Likes 3
tom@thomasp_19

@tilderesearch tilde locked in

18hViews 221Likes 1
Oleg kAI@oleg_kai

@tilderesearch buried beat is composition-aware optimization is the missing layer between adam independence and full joint updates. current tensor cores resist it. software-hardware co-design becomes the training-cost bottleneck before architecture does.

16hViews 229
Guilherme O'Tina@guilhermeotina

the tension is real: adam updates Q and K independently but the loss only sees QK^T. gradient components that cancel in the product are invisible to the loss yet consume update budget. the practical question is whether jacobian recomposition per circuit block buys enough to justify the extra compute vs just tying lr ratios between paired matrices

14hViews 131
Xidulu@xidulu

@tilderesearch This reminds me of Quack from @laurence_ai and @benaibean : https://arxiv.org/abs/2511.21377

8hViews 16
Ben@SolidlySheafy

@dvsaisurya This paper looks great, thanks for sharing

6hViews 7