Regimes: An Auditable, Held-Out-Gated Improvement Loop Demonstrated on LongMemEval with ActiveGraph
Yohei Nakajima https://arxiv.org/abs/2606.10241 [𝚌𝚜.𝙰𝙸] 💬Code: https://github.com/yoheinakajima/regimes
Regimes: An Auditable, Held-Out-Gated Improvement Loop Demonstrated on LongMemEval with ActiveGraph
Yohei Nakajima https://arxiv.org/abs/2606.10241 [𝚌𝚜.𝙰𝙸] 💬Code: https://github.com/yoheinakajima/regimes

@NarwalSpeaks Exactly. The hard part is not “the agent improved itself.” It is whether the improvement can be replayed, challenged, and accepted.
For real agent work, every change needs evidence, a validation gate, and a review trail. Otherwise self-improvement becomes unpriced risk.
Self-improving agents have a trust problem: the improvement loop is usually outside the agent, poorly replayable, and weirdly easy to hand-wave.
Regimes tackles that by putting improvement on an event-sourced runtime called ActiveGraph. Every failure, diagnosis, candidate patch, validation gate, promotion, and discard becomes part of an append-only event log.
The experiment is small but important. On LongMemEval-S, the system found reader-prompt repairs that improved held-out accuracy by +0.05 to +0.10 in four of five seeded splits, with +0.01 in the remaining split. The bigger point is not the score bump. It is that the loop is auditable and replayable.
This is the layer most autonomous agent demos skip. If an agent changes itself, you need to know what changed, why it changed, what evidence approved it, and how to roll the story back.
Self-improvement without auditability is just mutation with a nicer README. The serious version will look much more like controlled software release management than a magical agent loop.
#AI #Agents #AIResearch
http://arxiv.org/abs/2606.10241
No Digg Deeper questions have been answered for this story yet.