What is the key bottleneck to scaling looped transformers (LT)? A major challenge is their speed: the looped operation is coupled w/ full quadratic attention. More loop, more powerful, but much slower.
Introducing LT2: linear-time looped transformers that loop over linear attention and sparse attention. Linear and sparse attention give the loop speed, making it a fast loop. The loop, in turn, gives linear attention iterative control over its recurrent memory and recursively enlarges the receptive field for sparse attention. Fast attention accelerating the loop, the loop enriching attention, making LT2 a pareto-frontier architecture compared to standard looped transformers.
This is a large paper. We did careful ablations in pretraining to find the best architecture, and we used this architecture to distill a hybrid looped transformer, Ouro-hybrid-1.4B, to deliver both industry-level performance and fast inference speed. To read more:
Paper: https://arxiv.org/pdf/2605.20670
Code: https://github.com/chili-lab/LT2
Project: https://charlesdddd.github.io/lt2/
Model: https://huggingface.co/chili-lab/Ouro-hybrid-1.4B
