Sangyun Lee and Giulia Fanti propose a "sleep" phase to convert LLM context into fast weights and clear KV cache
Offline consolidation passes reduce quadratic attention costs during inference.
abs: https://arxiv.org/abs/2605.26099
Language Models Need Sleep "Transformer-based large language models are increasingly used for long-horizon tasks; however, their attention mechanism scales poorly with context length. To handle this, we study a sleep-like consolidation mechanism in which a model periodically converts recent context into persistent fast weights before clearing its key-value cache." "increasing sleep duration N for our models improves performance, with the largest gains on examples that require deeper reasoning."
Language models need "sleep"