raw STE still works unreasonably well for what it is as a biased gradient. you can train vqvae style codebooks with a MoE style topk=1 router instead of the hacky non-differentiable nearest neighbor stuff. it's much better behaved (right is a flat 65k vocab + 256 toks, mse recon)
