5h ago

MINTEval Benchmark Tests LLM Agents On Dynamic Long Contexts

57743238.0K

——0——

Original post

LLM agents & memory systems operate in continuously updated environments (Git repos, evolving docs). They must process long contexts, recover earlier information, and reason over many updates that create interference between old and new information. How well do they handle this? We introduce MINTEval: ✅ Frequent context changes & interference (avg. 86 updates) ✅ 5 challenging question types, including long-range lookback & reasoning over multiple targets distributed across context ✅ 4 realistic domains: state tracking, multi-turn dialogue, Wikipedia revisions, GitHub commits ✅ Avg. 138.8k tokens per instance (up to 1.8M) ✅ Human verification on generated QAs = 95.6% 📊 Across 7 representative systems, MINTEval remains difficult, showing an avg. acc of 27.9%, and the best system reaches only 33.4%. 🔎 Our analysis shows: • Memory construction failures cause a 41.7% drop • Memory agents are highly sensitive to design choices • Memory systems have a strong bias toward insertion operations (76.8%) over deletion/update

9:48 AM · May 20, 2026

Reposted by

#245@MOHITBAN47

QUOTE POST

#1438Elias Stengel-Eskin@ELIASESKIN

🚨 Excited to share MINTEval, a new benchmark for memory with interference. In real-world settings, agents need to handle continuously changing info (think of all your v2.5_final_final docs) .

MINTEval tests memory systems on frequent and interfering changes, across challenging question types (long-range lookback/recover, multi-target reasoning) and 4 realistic domains that challenge even the strongest models/agentic memory systems.

🧵👇

hyunji amy lee@hyunji_amy_lee

4:48 PM · May 20, 2026 · 6.4K Views

7:09 PM · May 20, 2026 · 495 Views