/Tech1h ago

General Reasoning's Ross Taylor says updated KellyBench shows leading open-source models remain six months behind closed frontier models

Story Overview

Ross Taylor's latest KellyBench refresh tests AI agents on building prediction models, spotting market edges, and applying Kelly-style staking across a full simulated Premier League season, where even the strongest open-weights entry, GLM 5.2, posts roughly -30% mean ROI while closed frontier models sit far ahead.

332352.9K

#501

Original post

Ross Taylor@rosstaylor90#842inTech

We are heads down on a big project right now, but some folks asked how recent open models perform on KellyBench so we updated the leaderboard.

GLM 5.2 is impressive, although our sense is that it is ~6 months behind on these type of quant benchmarks.

(We think SWE benchmarks likely underestimate the gap because of optimisation pressure towards those types of benchmarks)

General Reasoning@GenReasoning

We evaluated recent open models on KellyBench.

Here is what we found:

🏆 GLM 5.2 is new open source SoTA, but still loses -30% on average over 5 runs. 📈 We estimate GLM 5.2 is 6+ months behind the frontier based on KellyBench and internal quant evaluations. (Note: we have not evaluated Fable) 🌗 Kimi K2.6 slightly improves on Kimi K2.5 but still struggles at -60% average RoI. 🐈 Recent Mistral models struggle, obtaining mean RoIs of -78% and -99% respectively.

Leaderboard link and more graphs below.

3:11 AM · Jun 18, 2026 · 2.1K Views

Open Question

Long-horizon adaptation still favors closed systems

The benchmark's non-stationary environment and multi-seed ruin-avoidance checks surface gaps that static leaderboards often miss, though independent confirmation of the exact six-month timeline remains unavailable from contemporaneous sources.

Benchmark Limits

Sophistication scores add a human lens to raw numbers

A 52-point expert rubric evaluates strategy quality beyond ROI, yet the paper itself flags possible under-elicitation from single-agent setups and market-efficiency limits that could compress everyone's results.

Sentiment

Sentiment building, check back later.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Related links

introducing kellybench

GR.INCVia

Posts from X

Most Activity

VIEWS2.3KBOOKMARKS3LIKES20RETWEETS1

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

Below Opus 4.6. But it can avoid ruin. Yeah, 6-7 months sounds reasonable.

General Reasoning@GenReasoning

We evaluated recent open models on KellyBench.

Here is what we found:

Leaderboard link and more graphs below.

1h2.3K203

REPLIES1

Ross Taylor@rosstaylor90

@xeophon Yep, on KellyBench alone the difference from frontier is narrower. On our internal quant evals it looks a little worse - hoping we’ll be able to publish on those soon 🤞.

Florian Brand@xeophon

@rosstaylor90 Looks like it’s pretty much on par with 5.4 here?

1h13820

Florian Brand@xeophon

@rosstaylor90 Looks like it’s pretty much on par with 5.4 here?

General Reasoning@GenReasoning

Because of backtest variance, we also record a process-based rubric measure called "sophistication", which we track over time. This uses a human expert rubric to judge the sophistication of the strategies employed.

GLM-5.2 shows impressive sophistication compared to other open models, although it still does not surpass the closed SoTA on this metric for any time this year.

You can see the full leaderboard and more analysis here:

https://www.gr.inc/releases/introducing-kellybench

1h49940

Florian Brand@xeophon

@rosstaylor90 General Reasoning, the Hedgefund?

1h11