Really happy to share that “ToolFailBench” got accepted at two ICML 2026 workshops, FAGEN and AIWILD.
Most benchmarks evaluate tool-using agents with a single aggregate success rate, but that number can’t explain why a model actually fails. ToolFailBench is a diagnostic benchmark that scores tool use against a failure taxonomy instead of one number, breaking each trace into four distinct failure modes: skipping a tool that was needed, ignoring what a tool returns, fabricating tool outputs, and over-calling tools when none is needed. We find that models with similar aggregate scores fail in very different ways, so a single number isn’t enough to compare agents.