Researchers introduce QuantSightBench for LLM forecasting evaluation

VIEWS1.4KRETWEETS2

This feels like an incredibly important and promising direction. If next language token prediction was the beginning, whats the next step change in intelligence? numerical predictions? amazing work by Maksym and the team.

Maksym Andriushchenko@maksym_andr

💥New paper: LLMs are now used for high-stakes real-world decisions, but can their numerical predictions and uncertainty estimates be trusted?

We built QuantSightBench, a benchmark to measure how well frontier models forecast numerical outcomes across business, politics, etc.

Why forecasting?

Forecasting of world events is a great testbed for general LLM decision-making. The real world produces so many things that can be forecast, and the objective ground truth eventually gets revealed. This is the ultimate benchmark: you want to predict how the real world will unfold.

Beyond producing accurate point-wise forecasts, having correct uncertainty estimation is essential. LLMs typically don't produce consequential forecasts autonomously, but they rather assist human decision making. This requires calibrated uncertainty estimation, which is also a necessary skill for *agentic* LLM forecasting: the agent needs to know when to acquire more information and when to stop and commit to an answer.

Why *numerical* forecasting?

Nearly all prior LLM forecasting work evaluates on binary Polymarket-style questions (which is great, btw). However, most decisions that actually matter: GDP growth, ARR numbers, election margins, infrastructure timelines are not binary. They're numbers, and the confidence intervals there matter even more than the point estimates.

So we built a benchmark to measure this! This is joint work with Jeremy Qin @Jjq2221.

44d1.4K21

BOOKMARKS3

Maksym Andriushchenko@maksym_andr

Model-level takeaways: - Grok 4 is genuinely good at forecasting — beats GPT-5.4 👀 - But Gemini 3.1 Pro is still on top overall - Open-weight models lag meaningfully behind the frontier on interval calibration: the best proprietary model is Gemini 3.1 Pro (79.1% coverage at the 90% PI), while the best open-weight model is Kimi K2 Thinking (65.8%).

Maksym Andriushchenko@maksym_andr

Main results: frontier models are systematically miscalibrated on their prediction intervals. They undercover at every nominal level we tested.

We test frontier models in an agentic setting on QuantSightBench, and control for temporal leakage, resolution ambiguity, and the other pitfalls flagged by @dpaleka et al. by building on the OpenForecast pipeline from @nikhilchandak29, @ShashwatGoel7 et al. (https://arxiv.org/abs/2512.25070).

44d656133

LIKES16REPLIES1

Maksym Andriushchenko@maksym_andr

Main results: frontier models are systematically miscalibrated on their prediction intervals. They undercover at every nominal level we tested.

We test frontier models in an agentic setting on QuantSightBench, and control for temporal leakage, resolution ambiguity, and the other pitfalls flagged by @dpaleka et al. by building on the OpenForecast pipeline from @nikhilchandak29, @ShashwatGoel7 et al. (https://arxiv.org/abs/2512.25070).

Maksym Andriushchenko@maksym_andr

💥New paper: LLMs are now used for high-stakes real-world decisions, but can their numerical predictions and uncertainty estimates be trusted?

We built QuantSightBench, a benchmark to measure how well frontier models forecast numerical outcomes across business, politics, etc.

Why forecasting?

Forecasting of world events is a great testbed for general LLM decision-making. The real world produces so many things that can be forecast, and the objective ground truth eventually gets revealed. This is the ultimate benchmark: you want to predict how the real world will unfold.

Beyond producing accurate point-wise forecasts, having correct uncertainty estimation is essential. LLMs typically don't produce consequential forecasts autonomously, but they rather assist human decision making. This requires calibrated uncertainty estimation, which is also a necessary skill for *agentic* LLM forecasting: the agent needs to know when to acquire more information and when to stop and commit to an answer.

Why *numerical* forecasting?

Nearly all prior LLM forecasting work evaluates on binary Polymarket-style questions (which is great, btw). However, most decisions that actually matter: GDP growth, ARR numbers, election margins, infrastructure timelines are not binary. They're numbers, and the confidence intervals there matter even more than the point estimates.

So we built a benchmark to measure this! This is joint work with Jeremy Qin @Jjq2221.

44d1.2K162

Florian Brand@xeophon

@maksym_andr another tübingen banger

Maksym Andriushchenko@maksym_andr

💥New paper: LLMs are now used for high-stakes real-world decisions, but can their numerical predictions and uncertainty estimates be trusted?

We built QuantSightBench, a benchmark to measure how well frontier models forecast numerical outcomes across business, politics, etc.

Why forecasting?

Forecasting of world events is a great testbed for general LLM decision-making. The real world produces so many things that can be forecast, and the objective ground truth eventually gets revealed. This is the ultimate benchmark: you want to predict how the real world will unfold.

Beyond producing accurate point-wise forecasts, having correct uncertainty estimation is essential. LLMs typically don't produce consequential forecasts autonomously, but they rather assist human decision making. This requires calibrated uncertainty estimation, which is also a necessary skill for *agentic* LLM forecasting: the agent needs to know when to acquire more information and when to stop and commit to an answer.

Why *numerical* forecasting?

Nearly all prior LLM forecasting work evaluates on binary Polymarket-style questions (which is great, btw). However, most decisions that actually matter: GDP growth, ARR numbers, election margins, infrastructure timelines are not binary. They're numbers, and the confidence intervals there matter even more than the point estimates.

So we built a benchmark to measure this! This is joint work with Jeremy Qin @Jjq2221.

44d1.1K141

Maksym Andriushchenko@maksym_andr

Website: https://quantsightbench.com/ Paper: https://arxiv.org/abs/2604.15859 Code: https://github.com/aisa-group/quantsightbench

Almost everything you see is done by my amazing student @Jjq2221! That's the first paper of his PhD, and we are actively preparing some other cool projects :) stay tuned!

Maksym Andriushchenko@maksym_andr

Model-level takeaways: - Grok 4 is genuinely good at forecasting — beats GPT-5.4 👀 - But Gemini 3.1 Pro is still on top overall - Open-weight models lag meaningfully behind the frontier on interval calibration: the best proprietary model is Gemini 3.1 Pro (79.1% coverage at the 90% PI), while the best open-weight model is Kimi K2 Thinking (65.8%).

44d329111

Maksym Andriushchenko@maksym_andr

@xeophon we are actually gonna post another banger benchmark early next week! stay tuned.

44d843

Florian Brand@xeophon

@maksym_andr you are treating me too well...

44d612

Maksym Andriushchenko@maksym_andr

@xeophon and yes, there we did use native agent harnesses! and still all agents basically suck. it's gonna be a very interesting benchmark. i know i'm teasing too much...

44d561

Florian Brand@xeophon

@maksym_andr i do not expect anything else from the PTB + FutureSim ppl tbh

44d484

G, MD@DrBeavisAI

@maksym_andr Missing now GPT-5.5 xhigh and pro. Wonder if this is already better than most human experts

44d42

Noema@noemaclips

@maksym_andr @stalkermustang @Bayesian0_0 you might want to check this out!

44d6

Maksym Andriushchenko@maksym_andr

@DrBeavisAI we ran the experiments for the paper before GPT-5.5 got released. but we do have GPT-5.4 and it's in the top-3!

44d252

Igor Kotenkov@stalkermustang

@noemaclips @maksym_andr @Bayesian0_0 I've seen on my feed, but thanks for flagging!

I was a little bit disappointed the authors didn't test DSv4pro and GPT-5.5, though the latter s underrstandable

44d51