Kalomaze's agent evaluation benchmark finds GPT-5.5 leads on task reward while DeepSeek v4-flash dominates cost-efficiency · Digg