🤔It is time to rethink how we evaluate agent memory
🌍 As agents become longer horizon and more autonomous, memory is no longer just a module for storing past chats.
🛠️ It determines how agents track changing worlds, learn from past actions, revise outdated information, and reuse experience for future decisions.
🔍 This raises three key questions:
Are human designed write store retrieve memory pipelines still the best choice? If harnesses such as Codex, Claude Code, and OpenClaw already let agents observe, act, call tools, write files, and reorganize context, can memory also be managed by the harness itself? Do current evaluations really cover how agent memory is used in realistic settings? Many benchmarks are still text centric or single modal, with limited pressure from screenshots, GUIs, tool feedback, and environment changes.
❓ Is final QA accuracy enough?
🔥 We present WorldMemArena, a multimodal benchmark for evaluating agent memory through action world interaction.
📌 Key insights:
🧩 Memory is a lifecycle, not a static cache. 📉 Better memory storage does not necessarily lead to better final performance. 🖼️ Multimodal memory remains a major bottleneck for current systems. 🌍 Real agentic trajectories expose the fragility of memory systems. ⚙️ Harness-based memory is more flexible, but still costly and unstable.