LLM memory updates cut GPT-5.4 accuracy to 54 percent
Researchers at the University of Illinois Urbana-Champaign and Tsinghua University found that LLM agents lose reliability when they compress past experiences into rewritten memory summaries. The study showed GPT-5.4 accuracy on ARC-AGI tasks dropping from 100 percent without memory to 54 percent after consolidation of ground-truth solutions. Retaining raw episodic memories remained more dependable than condensed lessons generated by LLMs.
🚨Breaking new study: memory in LLM agents still can’t really be trusted, even after over trillion dollars has gone into the development of the field.
the whole point of databases is to keep reliable records of values that change over time.
in general, they can be trusted to be stable over time.
the whole point of LLMs is to raise money. in general they cannot be trusted.
the former (databases) will endure forever;
the latter (LLMs) will eventually be displaced by something more stable and efficient.
🚨Breaking new study: memory in LLM agents still can’t really be trusted, even after over trillion dollars has gone into the development of the field.
New Illinois+ Tsinghua University and other labs study finds that LLM agents still have unreliable memory and that it can get worse when they keep rewriting their own memories.
LLM agents can learn from experience, but their rewritten memories often become unreliable.
The problem is that many agent systems store past work by asking an LLM to compress messy experience into neat written lessons.
That sounds useful because the agent should remember what worked before, but the paper finds that repeated rewriting slowly damages the memory.
The core idea is that raw episodes, meaning the actual past attempts and solutions, often stay more useful than the polished lessons made from them.
The authors tested this across tasks like web shopping, simulated worlds, app use, and ARC-style puzzle problems where they could control the correct solutions.
The sharpest result is that GPT-5.4 solved 100% of a small ARC-AGI set with no memory, but after memory was built from correct solutions, streaming updates dropped it to about 54%.
The failures came from bad grouping, overbroad lessons, and overfitting, so the memory forgot details, mixed up task types, or learned rules that only worked on narrow examples.
The big deal is that agent memory should not automatically rewrite every experience into a summary, because keeping raw evidence and only sometimes making summaries worked better.
The paper is really proposing that agent memory should treat raw past episodes as important evidence, not as disposable notes to summarize away.
----
Paper Link – arxiv. org/abs/2605.12978
Paper Title: "Useful Memories Become Faulty When Continuously Updated by LLMs"
