We introduce TAPA that decouples magnitude from angular contributions in position encoding. TAPA yields better OOD (long-context) performance than vanilla RoPE approach. We also provide theoretical analysis why it works.
Thanks @yusidwang and the colleagues for the great work!
We’d like to introduce our paper on long-context positional encoding, centered on a simple principle: