Prime Intellect's Florian Brand argues that benchmark task size is an unreliable proxy for evaluating long-horizon LLM performance · Digg