Ethan Smith questions whether looped transformers perform iterative refinement through descent or distinct processes like representation building and classification refinement
Kalomaze notes LLMs maintain stable conditional distributions via cross-entropy loss.
@torchcompiled diffusion models are sampling from an "implicit" conditional distribution, that is to say: a not quite real one, shaped by assumptions (scheduling, gaussian prior, etc) that sort of guarantee that you aren't learning the true implicit distribution (MSE has no CE-like pressure)
I’ve seen a lot of perspectives of looped transformers as iterative refinement, compared to diffusion. Is there evidence to believe looped transformers are in fact doing some kind of descent of improvement? Or something more complex like, 2nd loop handles representation building and 3rd loop gets closer to classification refinement etc? Skeptical as diffusion from perspectives of ODE and Bayesian theories does feel like improving our guesses of the final endpoint with each pass, to something more and more on manifold. My gut feel is that looped transformers might be taking a more complex path as opposed to the guess/refine loop?
@torchcompiled meanwhile llms: stable exact conditional distribution, the shape & structure of it holds invariantly, doesn't depend on arbitrary or flaky noise parameterization, actual unbiased non-limit approximation guaranteed by cross entropy loss structure this seems... way better than ODE
@torchcompiled diffusion models are sampling from an "implicit" conditional distribution, that is to say: a not quite real one, shaped by assumptions (scheduling, gaussian prior, etc) that sort of guarantee that you aren't learning the true implicit distribution (MSE has no CE-like pressure)
@torchcompiled i feel like sampling steps are mostly wasted on solving for selecting a particular sample of the learned distribution of the diffuser in the diffusion context, rather than iterating in a way that progressively makes a sample "better" as it were ie no clear step count relationship
Not exactly what I’m getting at, the question is more, if I pass the intermediate representation to the lm head, say output of loop 1, 2, 3… so on, is each prediction improving from the last? Or is it a more complex path that’s doing other kinds of computation as opposed to just directly outputting something that will score better CE at each pass.
@torchcompiled - where more steps is ALWAYS better, because you're still fundamentally applying the same regression target transformation on the ae latents at varying noise levels, unless some ways of sampling reliably induce non-stochastic/structured usage of recurrent depth
@torchcompiled i feel like sampling steps are mostly wasted on solving for selecting a particular sample of the learned distribution of the diffuser in the diffusion context, rather than iterating in a way that progressively makes a sample "better" as it were ie no clear step count relationship
@torchcompiled something that learns to exploit recurrent depth *end to end* can learn to produce circuits that depend on that property. there's no room for such circuits to emerge for typical diffusion training bc its almost never e2e multistep, & not even BPTT/truncated, just various 1steps
@torchcompiled - where more steps is ALWAYS better, because you're still fundamentally applying the same regression target transformation on the ae latents at varying noise levels, unless some ways of sampling reliably induce non-stochastic/structured usage of recurrent depth
@torchcompiled i think if you train the looped transformer to independently refine some output target (unclear to me what that'd be analogously here) where there's detachment between the "refine loops", then what i am saying about the lack of coordination ALSO applies to the looped transformer
@torchcompiled something that learns to exploit recurrent depth *end to end* can learn to produce circuits that depend on that property. there's no room for such circuits to emerge for typical diffusion training bc its almost never e2e multistep, & not even BPTT/truncated, just various 1steps
@torchcompiled and in the detached case it would apply even if you carried the final hidden state or something back to the beginning in a true-ish recurrent matter, because from the "gradient of the loop"'s perspective, that state sort of just... showed up as the input on any nth iteration
@torchcompiled i think if you train the looped transformer to independently refine some output target (unclear to me what that'd be analogously here) where there's detachment between the "refine loops", then what i am saying about the lack of coordination ALSO applies to the looped transformer
Not exactly what I’m getting at, the question is more, if I pass the intermediate representation to the lm head, say output of loop 1, 2, 3… so on, is each prediction improving from the last? Or is it a more complex path that’s doing other kinds of computation as opposed to just directly outputting something that will score better CE at each pass.
@torchcompiled meanwhile llms: stable exact conditional distribution, the shape & structure of it holds invariantly, doesn't depend on arbitrary or flaky noise parameterization, actual unbiased non-limit approximation guaranteed by cross entropy loss structure this seems... way better than ODE
I think this is inaccurate. Implicit is true of GANs, not of diffusion, while we learn score rather than direct p(x) optimization, you can retrieve p(x) for a given datapoint through Monte Carlo estimates, so it’s more complicated but yeah the score is plenty for recovering p(x) up to scaling/normalizing factor, then starting from a Gaussian and integrating solves that issue. basically if you know the slope everywhere you can recover the function itself.
Also for not learning the true distribution, denoising score matching (and regular score matching, equivalent) minimize difference of score of data and score of model (aka fischer divergence) minimizes upper bound on KL of data distribution and model distribution , to the previous point, basically if derivative is equivalent everywhere, so is p(x) itself.
@torchcompiled diffusion models are sampling from an "implicit" conditional distribution, that is to say: a not quite real one, shaped by assumptions (scheduling, gaussian prior, etc) that sort of guarantee that you aren't learning the true implicit distribution (MSE has no CE-like pressure)
@kalomaze This is fair I can agree with this part, but now this along the sequence axis I guess instead of depth axis? But yeah more diffusion steps for diffusion just reduces ODE integration error which isn’t super helpful after a bit
@torchcompiled i feel like sampling steps are mostly wasted on solving for selecting a particular sample of the learned distribution of the diffuser in the diffusion context, rather than iterating in a way that progressively makes a sample "better" as it were ie no clear step count relationship