We are heads down on a big project right now, but some folks asked how recent open models perform on KellyBench so we updated the leaderboard.
GLM 5.2 is impressive, although our sense is that it is ~6 months behind on these type of quant benchmarks.
(We think SWE benchmarks likely underestimate the gap because of optimisation pressure towards those types of benchmarks)
We evaluated recent open models on KellyBench.
Here is what we found:
🏆 GLM 5.2 is new open source SoTA, but still loses -30% on average over 5 runs. 📈 We estimate GLM 5.2 is 6+ months behind the frontier based on KellyBench and internal quant evaluations. (Note: we have not evaluated Fable) 🌗 Kimi K2.6 slightly improves on Kimi K2.5 but still struggles at -60% average RoI. 🐈 Recent Mistral models struggle, obtaining mean RoIs of -78% and -99% respectively.
Leaderboard link and more graphs below.
