2h ago

Engineer Condenses Long-Horizon Agent Evaluations Into Smaller Targeted Subsets

0
Original post

you can condense long horizon evals with agents into smaller subsets that still let you test intended behavior. i'm currently evaluating an agent that runs for 30+ minutes, and analyzes thousands of traces at a time. here's my process: if you're evaluating whether X impacts Y agent output, oftentimes a lot of the information that exists in X isn't relevant to the decision of Y. i extract the reasoning out of trace, and then figure out what is the cause of a specific behavior. then, i know what situation i need to re-create when setting up my eval. and as a result, i can create a much smaller/simpler version of the long horizon eval that i can quickly use to figure out what i need to change in my prompts to get the behavior i want.

1:39 PM · May 20, 2026 View on X
Reposted by
Engineer Condenses Long-Horizon Agent Evaluations Into Smaller Targeted Subsets · Digg