Now for the Information Bottleneck framing. I can't believe I'm the one pushing back on this part 😫, and even though I personally agree with the intuition (the residual stream navigating an IB tradeoff in depth is a beautiful way to think about it) - it's hand-wavy.
And the "at scale" part is the underrated contribution. All L layers still run, with no truncation — which means KV cache / continuous batching / tensor parallelism stay untouched. They only re-route which layer feeds the sampler.

