there are long-horizon agent evals, and there are agent-to-agent evals.
i wanted one that tested both.
if agents are going to negotiate, sell, schedule, and coordinate with each other in the real world, we need evals where they interact over long horizons with state, tools, and measurable outcomes.
so naturally, i made them cold-call insurance leads lol (shoutout Juliano Massarelli).
i present SalesBench: a seller agent works a pipeline, calls an LLM buyer, manages time/tools/state, and gets scored by revenue closed.
trained a 2B model on it + ran early frontier model sweeps.
huge thanks to the Prime Intellect team and everyone who helped along the way @johannes_hage @willccbb @vincentweisser @GottliebEli @omouamoua @Ameen_ml @DennwsLee @OmShastri123
full breakdown:
https://hamzamostafa.com/blog/salesbench