8h ago

Paper titled “RoPE Distinguishes Neither Positions Nor Tokens in Long Contexts, Provably” proves Rotary Position Embeddings assign identical attention weights to distinct tokens and positions in long sequences

Position inversion and aliasing grow with context length from 10^4 to 10^8 base values.

5151239311.9K

——0——

Original post

#279@SOLDNIOP

Hao Peng@HAOPENG_UIUC

Excited to share our new paper: RoPE Distinguishes Neither Positions Nor Tokens in Long Contexts, Provably LLMs often fail on inputs well within their advertised context lengths. We show that these failures are not merely engineering issues, but from intrinsic limitations of RoPE in long contexts. Main finding: In long contexts, RoPE-based attention frequently assigns the same attention weight to a token even when it is moved to different positions. Similarly, it can assign the same attention weight to different tokens at the same position. In this sense, RoPE attention fails to distinguish both where a token appears and what token appears there — hence the title. We prove these results theoretically and verify them empirically. While the theoretical analysis focuses on a single attention head, we complement it with experiments on real multi-layer, multi-head LLMs. The experiments confirm failures predicted by our theory: LLMs optimized for needle-in-a-haystack-style retrieval will inevitably struggle on a very simple task that asks for the k-th item in a list. My personal takeaway: advertised context lengths should be interpreted with care. Future long-context LMs may require rethinking how position and token order are represented. With current architectures, agentic frameworks that break long contexts into shorter ones may be a more effective way to work around the intrinsic limitations of RoPE. Paper: https://arxiv.org/abs/2605.15514 Huge congrats to my student Yufeng Du and others!

9:55 AM · May 19, 2026

Reposted by

#40@JEREMYPHOWARD

Paper titled “RoPE Distinguishes Neither Positions Nor Tokens in Long Contexts, Provably” proves Rotary Position Embeddings assign identical attention weights to distinct tokens and positions in long sequences

Cluster engagement

Sentiment