2h ago

Kalomaze's agent evaluation benchmark finds GPT-5.5 leads on task reward while DeepSeek v4-flash dominates cost-efficiency

Claude-opus-4.7 placed second with a 0.832 reward score.

Kalomaze's agent evaluation benchmark finds GPT-5.5 leads on task reward while DeepSeek v4-flash dominates cost-efficiency · Digg