2h ago

Engineer Condenses Long-Horizon Agent Evaluations Into Smaller Targeted Subsets

5585375.9K

——0——

Original post

you can condense long horizon evals with agents into smaller subsets that still let you test intended behavior. i'm currently evaluating an agent that runs for 30+ minutes, and analyzes thousands of traces at a time. here's my process: if you're evaluating whether X impacts Y agent output, oftentimes a lot of the information that exists in X isn't relevant to the decision of Y. i extract the reasoning out of trace, and then figure out what is the cause of a specific behavior. then, i know what situation i need to re-create when setting up my eval. and as a result, i can create a much smaller/simpler version of the long horizon eval that i can quickly use to figure out what i need to change in my prompts to get the behavior i want.

1:39 PM · May 20, 2026

Reposted by

#739@HWCHASE17