This is something I have been thinking about after that @karpathy post on LLM Knowledge Bases. Fine-tuning models for maintaining better agent skills, memory, context engineering, routing efficiency, and knowledge bases is going to be huge.
You might also find this read interesting too:
Very good advice on self-improving agents.
(bookmark it)
This is something I am seeing in my own experiments with coding agents and harnesses for long-horizon tasks.
What I have found is that stronger models do not always evolve better agents.
The current believe in self-evolving agents is that a bigger model writes better prompt and skill edits, so devs put their best model in the evolver seat.
New research shows that intuition is mostly wrong.
The work separates two abilities that usually get conflated. Producing harness updates stays flat across model capability, so Qwen3.5-9B writes edits roughly as good as Claude Opus 4.6. Benefiting from those updates follows an inverted-U that peaks at mid-tier models, while weak models fail to even activate the edits and strong models have little headroom left.
This is important to understand as it tells you where to spend. Put a cheap model on the evolver and your expensive model on the solver, because the gains land solver-side, not evolver-side.
Paper: https://arxiv.org/abs/2605.30621
Learn to build effective AI agents in our academy: https://academy.dair.ai/