@scaling01 tasks are different, yes
Cognition made a long time-horizon benchmark that should be good up to ~64 hours
Mythos had 16+ hour time horizons in April, so there's ~2 more doublings or ~210 days after Mythos until the benchmark saturates
meaning the benchmark is cooked before the end-of-the year
(unless the task distribution is different. we have already seen how time-horizons differ on different coding benchmarks)