Garry Tan Shares Workflow For Self-Improving AI Agents With Progressive Evals
By evals I mean literally tell the agent: given what we discussed about what we are doing and why and what happened, use three different frontier models to look at inputs and outputs of your skill file calling the code, and rate it on effectiveness. Why isn’t it a 10? How could it be made to be so?
Run this a few times and you will be surprised how fast it gets astonishingly better
And since it is in a skill file plus code with evals (LLM as judge) and unit tests, it stays better forever
Funny how simple using openclaw and Hermes agent is these days Just have it do stuff. Then improve in progressive batches with evals from multiple frontier models. It self improves!