/Tech20h ago

Raw STE Enables Stable VQVAE Codebook Training With MoE Top-1 Routers

4603278.1K

Original post

raw STE still works unreasonably well for what it is as a biased gradient. you can train vqvae style codebooks with a MoE style topk=1 router instead of the hacky non-differentiable nearest neighbor stuff. it's much better behaved (right is a flat 65k vocab + 256 toks, mse recon)

12:36 PM · Jun 5, 2026 · 6.4K Views

Sentiment

Users praise STE for delivering clean results in enabling stable VQVAE codebook training with MoE top-1 routers.

Pos

100.0%

Neg

0.0%

1 comments with sentiment.

Cluster Engagement

Posts from X

Most Activity

VIEWS997BOOKMARKS3LIKES10RETWEETS1REPLIES1

kalomaze@kalomaze

this is fully end to end. no commitment loss term to avoid collapse, no "code gets fake ema update based on the average of the encoder outputs that nearest neighbor round to it", purely softmax + STE can learn a categorical where discrete embeddings map to smooth 64dim latents

kalomaze@kalomaze

20h997103

kalomaze@kalomaze

in principle the decoder latents can be any dimensionality and feed back into whatever decoding structure you want, i.e conditional diffusion/flow matching autoencoder that takes the routed 64dim code latents -> denoises at any noise level...

kalomaze@kalomaze

19h77071

Paco@Pacoxbt

@kalomaze STE keeps delivering clean results

20h8