New Agents' Last Exam benchmark finds frontier AI agents score under 10% on realistic expert tasks · Digg