This paper shows that LLM agents still struggle to plan through big, messy tool libraries.
The paper builds a retail benchmark PlanBench-XL, to test whether LLM agents can solve long tool-use tasks when tools are hard to find.
With 327 tasks and 1,665 tools, where agents must uncover hidden intermediate facts before they can answer.
Even strong models struggle, with GPT-5.4 getting 51.90% accuracy normally and dropping to 11.36% in the hardest blocked setting.
The problem is that real agents often face huge tool libraries, so they cannot see every tool at once and must search for useful ones while solving the task.
The core idea is to make agents plan both forward from what they know and backward from what they need, instead of giving them a clear tool path.
The authors also add broken or misleading tools, so agents must notice when a promising path fails and then find another path.
----
Link – arxiv. org/abs/2606.22388
Title: "PlanBench-XL: Evaluating Long-Horizon Planning of LLM Tool-Use Agents in Large-Scale Tool Ecosystems"