5h ago

Engineer Details Two Eval Suites For Robust General Agent Testing

75085910.2K

——0——

Original post

I've been thinking a lot about the two different groups of evals you need in general agents/agents which handle broad tasks: 1. Benchmark evals - this is a suite of up to 100 eval cases which test the happy paths of your agent, and its most common use cases. This isn't that comprehensive, but covers enough to where you can use it to quickly judge how well your agent handles tasks 2. Test coverage evals - this is a much more detailed suite (maybe up to 500, or more individual cases) that covers every single task you want your agent to be able to handle. It doesn't just include single tests for tasks, but multiple tests per use case, all with slightly different user prompting/tragectories There needs to be two suites for a few reasons: - general agents have so many use cases, to accurately test them, and have confidence it preforms well on everything you want to support, you need many evals for each workflow - the comprehensive eval suite will become too expensive to run on any sort of recurring basis (let alone ci) think $1000's per run, esp if you're supporting multiple models. so you need a smaller suite (the benchmark eval) to quickly gauge whether or not your agent works on code changes - in general agents, agents can preform the same tasks, but via very different paths. the final result is all the user cares about, but the intermediate steps can look very different. if your eval suite doesn't cover multiple paths to reach the same result, you can't be confident your agent will actually work well in all real world scenarios your users put your agent into there's a lot more nuance here, so maybe i'll write a longer blog post on it, and how we're thinking about maintaining/building eval suites this large...

12:45 PM · May 19, 2026

Engineer Details Two Eval Suites For Robust General Agent Testing

Sentiment

Cluster engagement