Researchers introduce QuantSightBench for LLM forecasting evaluation

REPLY

@maksym_andr another tübingen banger

Maksym Andriushchenko@maksym_andr

💥New paper: LLMs are now used for high-stakes real-world decisions, but can their numerical predictions and uncertainty estimates be trusted? We built QuantSightBench, a benchmark to measure how well frontier models forecast numerical outcomes across business, politics, etc. Why forecasting? Forecasting of world events is a great testbed for general LLM decision-making. The real world produces so many things that can be forecast, and the objective ground truth eventually gets revealed. This is the ultimate benchmark: you want to predict how the real world will unfold. Beyond producing accurate point-wise forecasts, having correct uncertainty estimation is essential. LLMs typically don't produce consequential forecasts autonomously, but they rather assist human decision making. This requires calibrated uncertainty estimation, which is also a necessary skill for *agentic* LLM forecasting: the agent needs to know when to acquire more information and when to stop and commit to an answer. Why *numerical* forecasting? Nearly all prior LLM forecasting work evaluates on binary Polymarket-style questions (which is great, btw). However, most decisions that actually matter: GDP growth, ARR numbers, election margins, infrastructure timelines are not binary. They're numbers, and the confidence intervals there matter even more than the point estimates. So we built a benchmark to measure this! This is joint work with Jeremy Qin @Jjq2221.

6:02 PM · May 16, 2026 · 3.9K Views

6:03 PM · May 16, 2026 · 497 Views

ORIGINAL POST

#1174Maksym Andriushchenko@MAKSYM_ANDR

💥New paper: LLMs are now used for high-stakes real-world decisions, but can their numerical predictions and uncertainty estimates be trusted?

We built QuantSightBench, a benchmark to measure how well frontier models forecast numerical outcomes across business, politics, etc.

Why forecasting?

Forecasting of world events is a great testbed for general LLM decision-making. The real world produces so many things that can be forecast, and the objective ground truth eventually gets revealed. This is the ultimate benchmark: you want to predict how the real world will unfold.

Beyond producing accurate point-wise forecasts, having correct uncertainty estimation is essential. LLMs typically don't produce consequential forecasts autonomously, but they rather assist human decision making. This requires calibrated uncertainty estimation, which is also a necessary skill for *agentic* LLM forecasting: the agent needs to know when to acquire more information and when to stop and commit to an answer.

Why *numerical* forecasting?

Nearly all prior LLM forecasting work evaluates on binary Polymarket-style questions (which is great, btw). However, most decisions that actually matter: GDP growth, ARR numbers, election margins, infrastructure timelines are not binary. They're numbers, and the confidence intervals there matter even more than the point estimates.

So we built a benchmark to measure this! This is joint work with Jeremy Qin @Jjq2221.

6:02 PM · May 16, 2026 · 3.9K Views

REPLY

#1174Maksym Andriushchenko@MAKSYM_ANDR

Main results: frontier models are systematically miscalibrated on their prediction intervals. They undercover at every nominal level we tested.

We test frontier models in an agentic setting on QuantSightBench, and control for temporal leakage, resolution ambiguity, and the other pitfalls flagged by @dpaleka et al. by building on the OpenForecast pipeline from @nikhilchandak29, @ShashwatGoel7 et al. (https://arxiv.org/abs/2512.25070).

Maksym Andriushchenko@maksym_andr

💥New paper: LLMs are now used for high-stakes real-world decisions, but can their numerical predictions and uncertainty estimates be trusted? We built QuantSightBench, a benchmark to measure how well frontier models forecast numerical outcomes across business, politics, etc. Why forecasting? Forecasting of world events is a great testbed for general LLM decision-making. The real world produces so many things that can be forecast, and the objective ground truth eventually gets revealed. This is the ultimate benchmark: you want to predict how the real world will unfold. Beyond producing accurate point-wise forecasts, having correct uncertainty estimation is essential. LLMs typically don't produce consequential forecasts autonomously, but they rather assist human decision making. This requires calibrated uncertainty estimation, which is also a necessary skill for *agentic* LLM forecasting: the agent needs to know when to acquire more information and when to stop and commit to an answer. Why *numerical* forecasting? Nearly all prior LLM forecasting work evaluates on binary Polymarket-style questions (which is great, btw). However, most decisions that actually matter: GDP growth, ARR numbers, election margins, infrastructure timelines are not binary. They're numbers, and the confidence intervals there matter even more than the point estimates. So we built a benchmark to measure this! This is joint work with Jeremy Qin @Jjq2221.

6:02 PM · May 16, 2026 · 3.9K Views

6:02 PM · May 16, 2026 · 722 Views

REPLY

#1174Maksym Andriushchenko@MAKSYM_ANDR

Model-level takeaways: - Grok 4 is genuinely good at forecasting — beats GPT-5.4 👀 - But Gemini 3.1 Pro is still on top overall - Open-weight models lag meaningfully behind the frontier on interval calibration: the best proprietary model is Gemini 3.1 Pro (79.1% coverage at the 90% PI), while the best open-weight model is Kimi K2 Thinking (65.8%).

Maksym Andriushchenko@maksym_andr

Main results: frontier models are systematically miscalibrated on their prediction intervals. They undercover at every nominal level we tested. We test frontier models in an agentic setting on QuantSightBench, and control for temporal leakage, resolution ambiguity, and the other pitfalls flagged by @dpaleka et al. by building on the OpenForecast pipeline from @nikhilchandak29, @ShashwatGoel7 et al. (https://arxiv.org/abs/2512.25070).

6:02 PM · May 16, 2026 · 722 Views

6:02 PM · May 16, 2026 · 471 Views

REPLY

#1174Maksym Andriushchenko@MAKSYM_ANDR

Website: https://quantsightbench.com/ Paper: https://arxiv.org/abs/2604.15859 Code: https://github.com/aisa-group/quantsightbench

Almost everything you see is done by my amazing student @Jjq2221! That's the first paper of his PhD, and we are actively preparing some other cool projects :) stay tuned!

Maksym Andriushchenko@maksym_andr

Model-level takeaways: - Grok 4 is genuinely good at forecasting — beats GPT-5.4 👀 - But Gemini 3.1 Pro is still on top overall - Open-weight models lag meaningfully behind the frontier on interval calibration: the best proprietary model is Gemini 3.1 Pro (79.1% coverage at the 90% PI), while the best open-weight model is Kimi K2 Thinking (65.8%).

6:02 PM · May 16, 2026 · 471 Views

6:02 PM · May 16, 2026 · 222 Views

Researchers introduce QuantSightBench for LLM forecasting evaluation

Cluster engagement

Cluster engagement

Sentiment