you can condense long horizon evals with agents into smaller subsets that still let you test intended behavior.
i'm currently evaluating an agent that runs for 30+ minutes, and analyzes thousands of traces at a time.
here's my process: if you're evaluating whether X impacts Y agent output, oftentimes a lot of the information that exists in X isn't relevant to the decision of Y.
i extract the reasoning out of trace, and then figure out what is the cause of a specific behavior. then, i know what situation i need to re-create when setting up my eval.
and as a result, i can create a much smaller/simpler version of the long horizon eval that i can quickly use to figure out what i need to change in my prompts to get the behavior i want.