UT Austin's Zhangyang Wang and colleagues release AgingBench to measure how deployed AI agents degrade even with frozen weights
Memory compression, interference, revision, and maintenance drive system degradation.
Really interesting new benchmark, AgingBench.
The idea is that deployed agents don't just have day-one capability, they age. Even with frozen weights, the effective state keeps changing as the agent compresses history, accumulates similar memories, revises facts, and undergoes maintenance. Reliability becomes a property of the full harness, not the model.
They break aging into four mechanisms - compression, interference, revision, maintenance, and tie each to a specific stage of the memory pipeline (write, retrieve, utilize, store), so the diagnosis tells you where to fix things.
They found that behavioral compliance and factual accuracy decouple: the agent keeps sounding right while the values quietly disappear, and violation-based monitoring catches nothing. They also found that revision aging looks representational and not capacity-bound, where scaling the model doesn't fix accumulator drift, but a typed JSON sidecar cuts error by ~47%. And Opus-4.7 reasons better but writes lower-fidelity artifacts, so the same surface failure needs a write-stage fix, not a retrieval-stage fix.
I like this framing because aging is a runtime control problem. The agent is constantly deciding what to write, compress, retrieve, and flush, and current systems make these decisions implicitly. Treating memory policy as a control policy, and aging curves as its closed-loop evaluation, feels like the right frame.
Cool work!