This paper proposes a new test to see whether AI agents truly get better as they gain experience and finds they mostly still confuse memory with learning.
Shows that simple full-context learning beats the more specialized memory systems, with Claude Sonnet 4.6 using plain context getting the best overall score.
That distinction matters because the next wave of AI is not supposed to answer isolated prompts.
It is supposed to live inside codebases, databases, markets, sensors, clinics, and workflows where yesterday’s mistake should make tomorrow’s action sharper.
The authors build CL-BENCH, a benchmark where an agent works through connected tasks in 6 domains, including coding, databases, forecasting, radio signals, poker, and disease studies.
Each task hides a pattern the agent can learn over time, like a database layout, a codebase structure, or an opponent’s strategy, so better performance should come from experience rather than pretraining.
They test frontier LLM systems with simple full-context memory, scratchpad notes, retrieval memory, playbook-style memory, and coding-agent setups.
The key finding is that current memory-heavy AI agents are not reliably better learners than just keeping the full conversation in context.
That means long-running AI agents still need better ways to remember useful lessons, forget stale ones, and adapt when the environment changes.
----
Link – arxiv. org/abs/2606.05661
Title: "Continual Learning Bench: Evaluating Frontier AI Systems in Real-World Stateful Environments"







