did yall know that next-residual-prediction can work, even when the residual "codes" are noisy vectors with no inherent per-level structure? 20m transformer, 256 vision patches. (dotted line represents the "prefill" point, where i start with the first few patches of the recon)
@willccbb @DimitrisPapail a problem imo is people rushing to do 1D categoricals, ala VQVAE multi-categorical prediction for truly higher dim data in the same patch, where you model residuals conditioned on coarse->fine, is underexplored tried it once with random vector "codebooks", it seemed to work