Cambridge Team Unveils Red Queen Gödel Machine for Co-Evolving AI Agents

Original post

elvis@omarsar0#684inTech

Fascinating paper on self-improving agents.

(bookmark it)

If you are working on agentic loops, you will quickly realize that they are only as good as the effectiveness of the evaluator.

Self-improvement loops tend to stall the moment the judge stops getting harder. The agent learns to satisfy a fixed evaluator rather than getting genuinely better. The Red Queen Gödel Machine, from Cambridge, co-evolves the agent and its evaluator together, so the bar keeps rising as the agent climbs.

The name borrows the evolutionary arms race. Both sides have to keep running to stay in place.

A frozen evaluator is where reward hacking creeps into self-improvement. Co-evolving the judge is a structural answer to that, and it keeps the loop honest over many rounds.

Paper: https://arxiv.org/abs/2606.26294

Learn to build effective AI agents in our academy: https://academy.dair.ai/

10:32 AM · Jun 28, 2026 · 15.9K Views

kepo@kepochnik

@omarsar0 i'm sure this paper will be useful for me

bookmarked for sure

6h1721

BOOKMARKS1

Tiziano Lattisi@TizianoLattisi

@omarsar0 I want to point out this example of open-source-based self-improving harness. It not only uses LLM judge but also deterministic comparisons of the runrecord.

https://axiastudio.github.io/aioc/tutorials/build-a-self-harness-workflow/

6h521

LIKES1

Blockphd区块博士.Ai｜LevelUpLabs@blockphd7

@omarsar0 Mark i will read it today

2h301

RETWEETS29

elvis@omarsar0

Fascinating paper on self-improving agents.

(bookmark it)

If you are working on agentic loops, you will quickly realize that they are only as good as the effectiveness of the evaluator.

The name borrows the evolutionary arms race. Both sides have to keep running to stay in place.

A frozen evaluator is where reward hacking creeps into self-improvement. Co-evolving the judge is a structural answer to that, and it keeps the loop honest over many rounds.

Paper: https://arxiv.org/abs/2606.26294

Learn to build effective AI agents in our academy: https://academy.dair.ai/

7h15.9K230341

REPLIES1

Jan Stevens@janstevens

@omarsar0 This is the core problem, a smart agent will absolutely learn the shape of the test. Co evolving the judge feels like the only way to keep progress real.

7h1011

MaatWork@MaatWorkX

@omarsar0 The evaluator is the bottleneck. A static judge turns self-improvement into adversarial optimization. The hard problem is building an evaluator that can't be gamed.

6h116

Adrian Chan@gravity7

@omarsar0 The generation-verification gap isn't a hard ceiling—it dissolves on factual tasks & may be elicitation of latent reasoning, not growth. But when generators escape their verifiers, they game them. See if newer calibration & multi-agent loops fix this 👇 https://inquiringlines.com/related/2606-26294-the-red-queen-g-del-machine-co-evolving-agents-and-their-evaluators/

5h281

Jaroslaw Wasowski@wasowskijarek

@omarsar0 Co-evolution raises the bar, but both halves can drift together. A judge from the same base inherits the agent's blind spots, so 'harder' just means harder where it already games. What keeps the loop honest isn't the race, it's an acceptance criterion neither side can author.

4h88