/AI2h ago

Tilde introduces Wall Attention, allowing RoPE-free models trained on 4,000 tokens to generalize to 200,000-token contexts without retraining

The mechanism adapts diagonal forget gates from linear RNNs.

--0--
Quote posts
Reposts
Original postAryaman Arora#678
Tilde@tilderesearch

Introducing Wall Attention. Diagonal forget gates enable RoPE-free attention with exceptional length generalization.

Wall outperforms the dominant method RoPE and sophisticated data-dependent methods like Forgetting Attention (FoX). We trained models with Wall on 4k sequence length and they generalized without further training to 200k+ tokens.

Wall generalizes diagonal forget gates from linear RNNs (KDA, RWKV 7, GLA) to softmax attention through a principled induced action framework. It enables transformers to selectively remember or forget per-channel within the attention head, dramatically boosting expressivity.

Wall is production-ready. Wall retains the parallel structure of vanilla attention, is compatible with GQA & MLA, and we open-source reference Triton kernels for training and decoding. Our WallDecode kernel achieves SOTA-level decode throughput.

Continual learning over long-context is fundamentally about selective forgetting → and Wall attention is all about selective forgetting.

8:57 AM · Jun 2, 2026 · 5.9K Views
Sentiment
Sentiment building, check back later.
Cluster Engagement
-
Views
-
Comments
-
Reposts
-
Bookmarks
Expand data
Posts from X
Most Activity
Most ActivityTimeline
VIEWS4.2KBOOKMARKS34LIKES57RETWEETS6REPLIES3
rohan anil@_arohan_

2-simplicial attention 🤝 wall attention, would be cool to prove this out formally.

1hViews 4.2KLikes 57Bookmarks 34
Tilde introduces Wall Attention, allowing RoPE-free models trained on 4,000 tokens to generalize to 200,000-token contexts without retraining · Digg