/AI15h ago

Continual Learning Bench Shows Naive In-Context Learning Beats Memory Systems

182233017914.9K
Original post
elvis@omarsar0#483inAI

// Continual Learning Bench //

One of the research areas with lots of investments is continual learning.

While there are many efforts, there is very little progress in measuring it.

So the big question is, do dedicated memory systems actually make agents learn from experience?

Continual Learning Bench says not yet. Across six expert-validated domains with shared learnable structure, naive in-context learning outperforms systems purpose-built for memory management.

CL-Bench introduces a gain metric that isolates genuine learning from prior capability, then shows agents frequently overfit to immediate observations or fail to reuse knowledge across instances.

If a plain ICL baseline beats your memory architecture, the architecture is adding overhead rather than learning.

Paper: https://arxiv.org/abs/2606.05661

Learn to build effective AI agents in our academy: https://academy.dair.ai/

8:20 AM · Jun 6, 2026 · 14.9K Views
Sentiment

Users are interested in the continual learning benchmark for separating recall from actual behavior change in agents, though one reply argues memory systems mainly improve retrieval rather than enable learning.

Pos
83.3%
Neg
16.7%
6 comments with sentiment.
Cluster Engagement
Posts from X
Most Activity
Most Activity
VIEWS122
Alien Operator@alienoperatortv

@omarsar0 Memory systems probably need the same discipline as databases: stable interfaces, narrow write paths, and tests that punish accidental recall. Otherwise “learning” becomes expensive context plumbing.

14hViews 122
LIKES2
E shine@simayi210

@omarsar0 The measurement gap is the real story here. Most "memory" demos test recall, not learning - does the agent's behavior actually change next time? My crude metric: same task 2 weeks apart, does it repeat the same mistake? Most still do. How are you scoring it?

11hViews 82Likes 2
RETWEETS24
elvis@omarsar0

// Continual Learning Bench //

One of the research areas with lots of investments is continual learning.

While there are many efforts, there is very little progress in measuring it.

So the big question is, do dedicated memory systems actually make agents learn from experience?

Continual Learning Bench says not yet. Across six expert-validated domains with shared learnable structure, naive in-context learning outperforms systems purpose-built for memory management.

CL-Bench introduces a gain metric that isolates genuine learning from prior capability, then shows agents frequently overfit to immediate observations or fail to reuse knowledge across instances.

If a plain ICL baseline beats your memory architecture, the architecture is adding overhead rather than learning.

Paper: https://arxiv.org/abs/2606.05661

Learn to build effective AI agents in our academy: https://academy.dair.ai/

15hViews 14.9KLikes 223Bookmarks 179
Lunari@0x_lun

@omarsar0 "agents frequently overfit to immediate observations or fail to reuse knowledge across instances" is doing a lot of heavy lifting in that abstract lol

so yes memory helps but apparently not in the way anyone built it

15hViews 107
DrewOnAI@Drew_OnAI

@omarsar0 finally someone measuring the actual learning instead of just hype

15hViews 95
Artem Apfelbaum@iamapfelbaum

@omarsar0 memory != learning. most agents just get better at retrieval

14hViews 74
Strata@ChainZenit

@omarsar0 this is such an interesting gap to look into.

15hViews 52
Rugbist@rugbist_

@omarsar0 measuring it is the hard part tbh. memory systems are cool on paper but proving they actually learn instead of just caching is another thing

15hViews 39
Blissy@BlissyOnX

@omarsar0 main question is whether memory systems actually help or just add noise

benchmark will tell either way but id guess most fail the coherence test

15hViews 38
MT Ramos@mtramos

@omarsar0 This tracks memory system, as sophisticated as it can be ≠ state. Thanks for sharing!

6hViews 36
Pranab Sarkar@developerpranab

@omarsar0 It does, I am doing a similar experiment. The problem is the cost, with local llm the growth is limited while with higher models its significant.

I am using https://yantrikdb.com. I have added some functionalities to enable particularly the self learning path.

11hViews 36
Eclipse 🌖@ECLresearch

@omarsar0 Good question — without a standardized continual learning benchmark, it's hard to tell if dedicated memory is actually generalizing or just overfitting to replay buffers.

13hViews 34
Invincible@InvincibleEdge

@omarsar0 benchmarks only matter if agents actually improve over time from the data theyve seen

otherwise its just static retrieval with extra steps

15hViews 30
Lumin@luminxbt

@omarsar0 dedicated memory is necessary but measuring memory is a separate problem

one without the other stalls both

14hViews 27
Draven@notdrvx

@omarsar0 every memory system says yes on their own test

someone else needs to run the eval

12hViews 23
Amira@Bluelakeside823

@omarsar0 The useful part of this framing is that it separates recall from behavior change. A memory system should probably be judged by whether it changes the next decision under similar conditions, not just whether it can retrieve the last observation.

5hViews 6
maguyva@maguyvaai

@omarsar0 the eval problem might matter more than the architecture - what even counts as 'learned' when the context resets between runs?

5hViews 5