raw STE still works unreasonably well for what it is as a biased gradient. you can train vqvae style codebooks with a MoE style topk=1 router instead of the hacky non-differentiable nearest neighbor stuff. it's much better behaved (right is a flat 65k vocab + 256 toks, mse recon)
Users praise STE for delivering clean results in enabling stable VQVAE codebook training with MoE Top-1 routers.
Most Activity
this is fully end to end. no commitment loss term to avoid collapse, no "code gets fake ema update based on the average of the encoder outputs that nearest neighbor round to it", purely softmax + STE can learn a categorical where discrete embeddings map to smooth 64dim latents
raw STE still works unreasonably well for what it is as a biased gradient. you can train vqvae style codebooks with a MoE style topk=1 router instead of the hacky non-differentiable nearest neighbor stuff. it's much better behaved (right is a flat 65k vocab + 256 toks, mse recon)
in principle the decoder latents can be any dimensionality and feed back into whatever decoding structure you want, i.e conditional diffusion/flow matching autoencoder that takes the routed 64dim code latents -> denoises at any noise level...
this is fully end to end. no commitment loss term to avoid collapse, no "code gets fake ema update based on the average of the encoder outputs that nearest neighbor round to it", purely softmax + STE can learn a categorical where discrete embeddings map to smooth 64dim latents

@kalomaze STE keeps delivering clean results