Tilde introduces Wall Attention, allowing RoPE-free models trained on 4,000 tokens to generalize to 200,000-token contexts without retraining

Original post

Tilde@tilderesearch

Introducing Wall Attention. Diagonal forget gates enable RoPE-free attention with exceptional length generalization.

Wall outperforms the dominant method RoPE and sophisticated data-dependent methods like Forgetting Attention (FoX). We trained models with Wall on 4k sequence length and they generalized without further training to 200k+ tokens.

Wall generalizes diagonal forget gates from linear RNNs (KDA, RWKV 7, GLA) to softmax attention through a principled induced action framework. It enables transformers to selectively remember or forget per-channel within the attention head, dramatically boosting expressivity.

Wall is production-ready. Wall retains the parallel structure of vanilla attention, is compatible with GQA & MLA, and we open-source reference Triton kernels for training and decoding. Our WallDecode kernel achieves SOTA-level decode throughput.

Continual learning over long-context is fundamentally about selective forgetting → and Wall attention is all about selective forgetting.

Tilde@tilderesearch

http://x.com/i/article/2061750934179643392

8:57 AM · Jun 2, 2026 · 19.5K Views

2507.02754

ARXIV.ORGVia

#102

VIEWS17.8KBOOKMARKS149LIKES178RETWEETS18REPLIES3

rohan anil@_arohan_

2-simplicial attention 🤝 wall attention, would be cool to prove this out formally.

Tilde@tilderesearch

http://x.com/i/article/2061750934179643392

27d17.8K178149

rohan anil@_arohan_

Ah it only has the 3-way product on score and not on values. So its not full 2-simplicial form https://arxiv.org/pdf/2507.02754

27d2.2K143

Tilde@tilderesearch

Read the full post here: https://blog.tilderesearch.com/blog/wall-attn

27d26982

rohan anil@_arohan_

Its diagonal on the scores as well. So strict subset

rohan anil@_arohan_

Ah it only has the 3-way product on score and not on values. So its not full 2-simplicial form https://arxiv.org/pdf/2507.02754

27d1.9K91

rohan anil@_arohan_

v2=vec(1) k1=k1 times exp-(sum log(g)) k2=exp(log(g) add a mask on q for modulating 3rd axis

The k1k2 produces the wall logits F_ij

Fin.

rohan anil@_arohan_

Its diagonal on the scores as well. So strict subset

27d61871

rohan anil@_arohan_

https://blog.tilderesearch.com/blog/wall-attn

27d88861

Tilde@tilderesearch

Wall builds off a strong body of prior work on data-dependent positional embeddings. We want to explicitly cite a few ideas we drew inspiration from, in particular PaTH by @SonglinYang4 and KDA by @yzhang_cs.

27d1807

Michael Hla@hla_michael

@tilderesearch This is beautiful

27d3911

Yu Zhang 🐙🌘@yzhang_cs

@tilderesearch @SonglinYang4 very glad if you could contribute wall attn to FLA

27d332

Guilherme O'Tina@guilhermeotina

the score vs values split makes sense practically. trilinear on scores is O(n²) per head, same as standard attention. trilinear on values would be O(n² d) at least. the score gate is where the length extrapolation leverage lives anyway, because it controls which token pairs interact. value mixing is just averaging, adding a third tensor there is expensive without a clear mechanism

27d109