Trajectory-based error analysis points to levers for post-training and harness engineering!
From the @harvey team:
- Verify-and-revise correlates with the biggest score jump (+1.5).
- "Fan-out" tool parallelism hurts (-0.5); potentially adds noise without direction
- Grounding drafts against source evidence is +0.3, but only occurs in 19% of trajectories
Excited for more behavior-level analysis over long-horizon agent evals - great example here from Legal Agent Benchmark (LAB)!