/AI21h ago

Raw STE Enables Stable VQVAE Codebook Training With MoE Top-1 Routers

4603278.2K
Original post
kalomaze@kalomaze#839inAI

raw STE still works unreasonably well for what it is as a biased gradient. you can train vqvae style codebooks with a MoE style topk=1 router instead of the hacky non-differentiable nearest neighbor stuff. it's much better behaved (right is a flat 65k vocab + 256 toks, mse recon)

12:36 PM · Jun 5, 2026 · 6.4K Views
Sentiment

Users praise STE for delivering clean results in enabling stable VQVAE codebook training with MoE Top-1 routers.

Pos
100.0%
Neg
0.0%
1 comments with sentiment.
Cluster Engagement
Posts from X
Most Activity
Most Activity
VIEWS1KBOOKMARKS3LIKES10RETWEETS1REPLIES1
kalomaze@kalomaze

this is fully end to end. no commitment loss term to avoid collapse, no "code gets fake ema update based on the average of the encoder outputs that nearest neighbor round to it", purely softmax + STE can learn a categorical where discrete embeddings map to smooth 64dim latents

kalomaze@kalomaze

raw STE still works unreasonably well for what it is as a biased gradient. you can train vqvae style codebooks with a MoE style topk=1 router instead of the hacky non-differentiable nearest neighbor stuff. it's much better behaved (right is a flat 65k vocab + 256 toks, mse recon)

21hViews 1KLikes 10Bookmarks 3
kalomaze@kalomaze

in principle the decoder latents can be any dimensionality and feed back into whatever decoding structure you want, i.e conditional diffusion/flow matching autoencoder that takes the routed 64dim code latents -> denoises at any noise level...

kalomaze@kalomaze

this is fully end to end. no commitment loss term to avoid collapse, no "code gets fake ema update based on the average of the encoder outputs that nearest neighbor round to it", purely softmax + STE can learn a categorical where discrete embeddings map to smooth 64dim latents

21hViews 775Likes 7Bookmarks 1
Paco@Pacoxbt

@kalomaze STE keeps delivering clean results

21hViews 8