UIUC's PlanBench-XL benchmark finds LLM agents struggle to plan when navigating large, disorganized tool libraries · Digg