Nicholas Tomlin of NYU CDS releases a 10-task benchmark showing LLMs fail as user simulators because they lack human-like forgetting

QUOTE POST

As Nick says, we’re excited about the potential for leveraging cognitive science to improve user simulators, then use them to evaluate and train models that collaborate better with real humans.

I'm hiring (including but not limited to postdocs), let me know if interested in this direction!

Nicholas Tomlin@NickATomlin

New paper! LLM memory keeps improving, but this makes them *worse* as user sims. If we want to build models that can, e.g., simulate realistic students to train chatbots to be better teachers, then these models need to be able to forget like humans do 📄: https://arxiv.org/abs/2605.25680

5:54 PM · May 27, 2026 · 14K Views

9:00 PM · May 27, 2026 · 2.8K Views

POST

#1165Nicholas Tomlin@NICKATOMLIN

New paper! LLM memory keeps improving, but this makes them *worse* as user sims. If we want to build models that can, e.g., simulate realistic students to train chatbots to be better teachers, then these models need to be able to forget like humans do

📄: https://arxiv.org/abs/2605.25680

Title page of our paper, "Simulating Human Memory with Language Models"

5:54 PM · May 27, 2026 · 14K Views

REPLY

#1165Nicholas Tomlin@NICKATOMLIN

We found that across tasks, language models perform at ceiling (for example, remembering lists of 20 digits perfectly, without any errors), even when prompted to behave like humans with limited working memory. This trend holds for a variety of models and prompting strategies

A figure showing that models (Qwen3-8B and GPT-5.4) perform much better than humans across a range of 10 memory tasks

Nicholas Tomlin@NickATomlin

To compare humans and language models, we built a suite of 10 memory tasks, ranging from classic working memory tests (“remember this list of numbers”) to more open-ended tasks (“study this map and answer questions about it”)

5:54 PM · May 27, 2026 · 555 Views

5:54 PM · May 27, 2026 · 395 Views

REPLY

#1165Nicholas Tomlin@NICKATOMLIN

Since prompting alone isn’t enough to simulate human memory, we also introduce an approach called COMPACTOR, where an LLM agent writes to a key-value memory store. We find that this leads to more human-like memory behavior:

Pseudocode for our COMPACTOR model, which puts information in a key-value memory store

Results figure comparing three prompting strategies with our COMPACTOR model. The COMPACTOR model achieves higher human-likeness scores

Nicholas Tomlin@NickATomlin

We found that across tasks, language models perform at ceiling (for example, remembering lists of 20 digits perfectly, without any errors), even when prompted to behave like humans with limited working memory. This trend holds for a variety of models and prompting strategies

5:54 PM · May 27, 2026 · 395 Views

5:54 PM · May 27, 2026 · 210 Views

REPLY

#1165Nicholas Tomlin@NICKATOMLIN

Finally, we show preliminary evidence that user simulators with more human-like memory are more useful. In particular, we find that our most human-like model is more capable of predicting which LLM outputs humans will best understand and remember:

Results figure from our educational document reranking experiment. See the paper for more details.

Nicholas Tomlin@NickATomlin

Since prompting alone isn’t enough to simulate human memory, we also introduce an approach called COMPACTOR, where an LLM agent writes to a key-value memory store. We find that this leads to more human-like memory behavior:

5:54 PM · May 27, 2026 · 210 Views

5:54 PM · May 27, 2026 · 263 Views

REPLY

#1165Nicholas Tomlin@NICKATOMLIN

This work was done with Qihan Wang, @michahu8, Brian Dillon, and @tallinzen! We’re excited about the continued potential for leveraging ideas from cogsci/linguistics and using them to improve user sims, which can be used to train models that collaborate better with real humans

Nicholas Tomlin@NickATomlin

Finally, we show preliminary evidence that user simulators with more human-like memory are more useful. In particular, we find that our most human-like model is more capable of predicting which LLM outputs humans will best understand and remember:

5:54 PM · May 27, 2026 · 263 Views

5:54 PM · May 27, 2026 · 235 Views

Nicholas Tomlin of NYU CDS releases a 10-task benchmark showing LLMs fail as user simulators because they lack human-like forgetting

Cluster engagement

Sentiment