HOLA Adds Exact KV Cache To Linear Attention For Robust Long-Range Recall

Original post

elvis@omarsar0

NEW paper worth reading.

(bookmark it)

The basic idea is to pair a compressive recurrent state with a small exact memory, which helps to recover long-range recall without giving up the efficiency of linear attention.

More on it below:

Linear-attention and state-space models compress the whole prefix into a fixed-size state. That buys O(1) memory, but when many key-value associations compete, earlier facts get overwritten and needle recall degrades.

HOLA gives linear attention a hippocampal complement. It keeps the usual delta-rule state as compressive memory and adds a bounded exact KV cache, forming a semiparametric test-time memory.

The state models linearly compressible structure while the cache stores associations that should not be forced through it. The cache writes without a learned eviction module, keeping only tokens whose prediction residual was actually committed to the state.

At 340M parameters on 15B SlimPajama tokens, HOLA lowers Wikitext perplexity from 27.32 to 22.92, below a full-attention Transformer++ at 26.88, and stays robust on RULER needle recall out to 32k tokens, 16x its training length.

Paper: https://arxiv.org/abs/2607.02303

Learn to build effective AI agents in our academy: https://academy.dair.ai/

8:38 AM · Jul 3, 2026 · 13.1K Views

Vanar@Vanarchain

@omarsar0 Good AI memory isn't just about storing more. It's about knowing what needs to stay exact and what can be compressed.

13h153

LIKES1

Mohammed Benoughidene@mohamedbeno22

@omarsar0 smart way to fix the needle-in-a-haystack problem without falling back to full attention. would love to see this tested on actual codebases instead of just benchmarks.

14h581

RETWEETS15

elvis@omarsar0

NEW paper worth reading.

(bookmark it)

The basic idea is to pair a compressive recurrent state with a small exact memory, which helps to recover long-range recall without giving up the efficiency of linear attention.

More on it below:

HOLA gives linear attention a hippocampal complement. It keeps the usual delta-rule state as compressive memory and adds a bounded exact KV cache, forming a semiparametric test-time memory.

Paper: https://arxiv.org/abs/2607.02303

Learn to build effective AI agents in our academy: https://academy.dair.ai/

17h13.1K135152

Hussain Hashim | Building SundayBack@itsthedonhashim

@omarsar0 @omarsar0 this sounds like it could change how we think about handling long-range dependencies. linear attention always seemed like it had that trade-off, so this is super interesting!

15h461

Jeffrey Li 💙💛@askerlee

@omarsar0 The largest trained model is only 340M, so any conclusions drawn from these experiments are not very reliable

16h129

V0LYX@0xV0LYX

@omarsar0 Small exact memory is clever. Keeps the throughput without losing context that long.

Questions

So it is LSTM-esque in how the comp state and exact mem interact? Or more novel on the alignment side?

16h87

Melio@MelioHL

@omarsar0 Bookmarked it. Linear attention finally getting serious about memory compression without blowing up compute feels like the missing piece.

16h73

NTK AI@NtokozoAI

Great read. Let me try simplify it…

Think about how you take notes in a long meeting.

You keep a running summary of the discussion, but names, numbers and dates get written down exactly because a summary can distort them.

That is the idea behind Hippocampal Linear Attention (HOLA). Efficient models keep the cheap summary, then add a small exact notepad for details the summary failed to hold.

The bigger shift is not more memory. It is memory that knows what to compress and what to protect.

8h24

Michael Robinson@m8rbnsn

@omarsar0 This paper doesn't even cite, much less ablate against, "Just read twice: closing the recall gap for recurrent language models" (Simran Arora et al, 2024)

7h21

Eclipse 🌖@ECLresearch

@omarsar0 Compressive recurrence paired with a small exact memory is a clean trade-off — curious if the recall recovery holds up under long-context perplexity benchmarks vs. Mamba-2 or GLA.

7h2

MrGreenie@YourGreenie989

@omarsar0 Reviewed AtomicMemory and appreciate the transparency design. Open source, self-hosted. You can read, audit, and correct your agent's beliefs directly. Principled memory governance

13h1