1d ago

AI Engineer Recommends Incremental Evals For Long-Running Agents

0
Original post

when evaluating long running agents, all of your evals don't need to be end to end. i'm working on a proper blog about this, but in our evals for our agents that run for 30-60 minutes, we have two sets of evals. the first is end to end, provide inputs and llm as a judge over outputs. the second are incremental. in the end to end flow, there's probably 4-5 incremental steps and/or decisions that dictate how the agent performs. we write both sets of evals! one is for confidence in our overall system, and the other is confidence in the reproducibility of our agent behavior.

11:59 AM · May 18, 2026 View on X