A great interview! I agree with the high-level principles for designing a continual learning benchmark, which we also adopt in our recent work, "AGENTCL: Toward Rigorous Evaluation of Continual Learning in Language Agents" (led by @YihengShu @osunlp): Tasks in sequence, cross-task relationship, and gain metrics. In AgentCL, we repurpose popular benchmarks in coding, deep research, and language understanding, and propose an exp setup to control cross-task relationships in stream and define gain metrics to measure plasticity, stability, and generalization.
There are a lot of methodologies that can enable continual learning. In AgentCL, we focus on non-parametric memory designs, which are lightweight, inference-time agent adaptations, and conduct a comprehensive evaluation.
https://arxiv.org/abs/2606.02461
capability != learning
new benchtalks with @pgasawa on continual learning, where we discuss teaching models to learn from experience, measuring learning ability, the bet on parametric models, and more
01:06 What is continual learning? 04:10 Why capability and learning are different 06:13 Why build a benchmark? 08:07 Continual Learning Bench launch and reception 09:13 Anthropic's Fable release and Continual Learning Bench 11:02 How to design tasks for continual learning 18:41 The gain metric 24:01 What good looks like on the leaderboard 29:13 Failure modes: why models can't update their beliefs 31:12 Parametric systems and future architectures 34:30 Open science and AI safety 45:42 Lightning round 49:08 How to contribute to Continual Learning Bench

