Arvindh Arun introduces FutureSim, a simulation framework that streams 244,000 real news articles starting January 1, 2026 for agents to forecast events and test continual learning in frontier models

VIEWS77.3KBOOKMARKS235

Introducing FutureSim: where we replay a temporal slice of the web and let agents forecast real-world events over time 🔮🌎

FutureSim replays the web day by day. Agents start on Jan 1, 2026 (past their knowledge cutoffs) with date-gated access to real news articles and forecast on real-world events resolving over the next 90 days. Around 244K new articles stream in during the simulation. Agents decide which questions to answer, what to search for, and when to advance to the next day 🤔

We evaluate frontier models in their native harness. GPT 5.5 (Codex) leads at 25% acc, followed by Opus 4.6 (Claude Code) at 20% 📈 Open weight frontier models have a significant gap to catch up, with DeepSeek V4 pro at 13%, GLM 5.1 at 10%, and Qwen3.6 Plus at 5%

On some questions that have a parallel @Polymarket market, we find that GPT 5.5 in our simulation sometimes beats the crowd aggregate, like in the Super Bowl LX ($704M traded) market 💰💸

FutureSim serves as a test bed for evaluating a lot of important agentic capabilities > Adaptation: how agents adapt beliefs over time, and handle new incoming information and environment feedback > Memory: how agents make the best use of external memory to store persistent insights and handle context limitations over a thousand tool calls > Search: how agents find relevant information over thousands of articles streaming in > Inference scaling: how agents benefit from scaling inference compute

More cool insights and deep dives in our paper 👇

46d77.3K322235

LIKES639REPLIES38

Yann LeCun@ylecun

@ziv_ravid Continuous, high-dimensional, noisy data. LLMs totally suck at those.

Ravid Shwartz Ziv@ziv_ravid

I must admit that 1-2 years ago, I was sure that LLMs would be much better at predicting the future. It makes sense that if you can search and aggregate information from different sources, you can predict events. But so far, all the models have failed quite badly. I'm not sure what the missing parts are here. It might be a good memory system, but it might be something more fundamental, such as a missing internal world model. Anyway, very interesting problem to (try) to solve

44d34.4K63988

RETWEETS53

Shashwat Goel@ShashwatGoel7

Continual learning is bottlenecked by realistic evaluations

Introducing FutureSim, which replays real-world events in the temporal order they occurred

We benchmark frontier agents at updating predictions about how our world evolves, in native harnesses like Codex, Claude Code

46d102.6K505376

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

Future prediction benchmarks are trivially scalable, uncheatable (under some assumptions) and impossible to saturate. They should be getting more love.

Shashwat Goel@ShashwatGoel7

Continual learning is bottlenecked by realistic evaluations

Introducing FutureSim, which replays real-world events in the temporal order they occurred

We benchmark frontier agents at updating predictions about how our world evolves, in native harnesses like Codex, Claude Code

45d24.9K27793

Lisan al Gaib@scaling01

new forecasting benchmark: FutureSim

GPT-5.5 performs the best at 25%, but Mythos, Gemini 3.1 Pro and Opus 4.7 are not included. Based on their Brier Skill Score the models don't seem to be much better than just assigning equal probabilities to all outcomes

Arvindh Arun@arvindh__a

Introducing FutureSim: where we replay a temporal slice of the web and let agents forecast real-world events over time 🔮🌎

FutureSim replays the web day by day. Agents start on Jan 1, 2026 (past their knowledge cutoffs) with date-gated access to real news articles and forecast on real-world events resolving over the next 90 days. Around 244K new articles stream in during the simulation. Agents decide which questions to answer, what to search for, and when to advance to the next day 🤔

We evaluate frontier models in their native harness. GPT 5.5 (Codex) leads at 25% acc, followed by Opus 4.6 (Claude Code) at 20% 📈 Open weight frontier models have a significant gap to catch up, with DeepSeek V4 pro at 13%, GLM 5.1 at 10%, and Qwen3.6 Plus at 5%

On some questions that have a parallel @Polymarket market, we find that GPT 5.5 in our simulation sometimes beats the crowd aggregate, like in the Super Bowl LX ($704M traded) market 💰💸

FutureSim serves as a test bed for evaluating a lot of important agentic capabilities > Adaptation: how agents adapt beliefs over time, and handle new incoming information and environment feedback > Memory: how agents make the best use of external memory to store persistent insights and handle context limitations over a thousand tool calls > Search: how agents find relevant information over thousands of articles streaming in > Inference scaling: how agents benefit from scaling inference compute

More cool insights and deep dives in our paper 👇

45d50.3K24383

Lisan al Gaib@scaling01

My take on why LLMs still suck at forecasting is that in some cases they are still over-reliant on pre-training priors and that they are getting fried by post-training / RL and safety, which makes them non-committal and hedgy. But I also think it's just not a skill that they are picking up in any math or coding envs.

In the GPT-4 there was a slide on the calibration of base-model vs the final post-trained model. The base model was almost perfectly calibrated, the post-trained model wasn't.

(this is of course weak'ish evidence because we don't know whether this is still true with current models and how its calibration translates to forecasting performance)

Lisan al Gaib@scaling01

new forecasting benchmark: FutureSim

GPT-5.5 performs the best at 25%, but Mythos, Gemini 3.1 Pro and Opus 4.7 are not included. Based on their Brier Skill Score the models don't seem to be much better than just assigning equal probabilities to all outcomes

44d32.4K16753

Ravid Shwartz Ziv@ziv_ravid

I must admit that 1-2 years ago, I was sure that LLMs would be much better at predicting the future. It makes sense that if you can search and aggregate information from different sources, you can predict events. But so far, all the models have failed quite badly. I'm not sure what the missing parts are here. It might be a good memory system, but it might be something more fundamental, such as a missing internal world model. Anyway, very interesting problem to (try) to solve

Lisan al Gaib@scaling01

new forecasting benchmark: FutureSim

GPT-5.5 performs the best at 25%, but Mythos, Gemini 3.1 Pro and Opus 4.7 are not included. Based on their Brier Skill Score the models don't seem to be much better than just assigning equal probabilities to all outcomes

44d27.2K9047

Nikhil Chandak@nikhilchandak29

Introducing FutureSim, the first interactive environment testing agents on predicting world events.

We build a simulation where agents face forecasting questions over the course of 3 months. News articles come in each day and agents continuously revise their prediction in light of new information as we show below for GPT-5.5. (1/5)

46d11.3K7226

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

actually lots of interesting empirical results here, that go beyond "forecasting the future eval". Models differ a lot in how they respond to starting from the worst agent's outputs! V4 is generally bad here but recovers the most. Is it just an issue of in- vs out- distribution?

Shashwat Goel@ShashwatGoel7

Continual learning is bottlenecked by realistic evaluations

Introducing FutureSim, which replays real-world events in the temporal order they occurred

We benchmark frontier agents at updating predictions about how our world evolves, in native harnesses like Codex, Claude Code

44d13K5929

Maksym Andriushchenko@maksym_andr

💥 Check out our new paper: FutureSim: Replaying World Events to Evaluate Adaptive Agents.

We create a *reproducible* long-horizon environment where agents have to make forecasts during a 3-month period.

The best performing agent, GPT 5.5 in Codex, consumes 3700 turns and 12.4M tokens spanning many sequential context window compactions in a single run.

(Led by @ShashwatGoel7, @nikhilchandak29, @arvindh__a!)

Shashwat Goel@ShashwatGoel7

Continual learning is bottlenecked by realistic evaluations

Introducing FutureSim, which replays real-world events in the temporal order they occurred

We benchmark frontier agents at updating predictions about how our world evolves, in native harnesses like Codex, Claude Code

46d3.1K3311

Ameya P.@AmyPrb

Can agents continually adapt their beliefs with new information from real-world events?

We provide a testbed for LLM agents to learn to accumulate useful signals across time.

Exciting new directions👇: • Memory • Search • Multi-agent self-play • Inference Scaling

Shashwat Goel@ShashwatGoel7

Continual learning is bottlenecked by realistic evaluations

Introducing FutureSim, which replays real-world events in the temporal order they occurred

We benchmark frontier agents at updating predictions about how our world evolves, in native harnesses like Codex, Claude Code

46d471111

Arvindh Arun@arvindh__a

paper: https://alphaxiv.org/abs/2605.15188 blog: https://openforecaster.github.io/futuresim/

work done with @ShashwatGoel7 @nikhilchandak29 @AmyPrb steffen staab, moritz hardt @maksym_andr @jonasgeiping

46d268102

Nikhil Chandak@nikhilchandak29

FutureSim is a long-horizon benchmark requiring agents to reason under uncertainty and adapt at test-time. Agents can search, choose when and which questions to update prediction on, write memory and learn from resolutions.

For example, GPT-5.5 (xhigh) in Codex consumes over 3700 tool calls and 12M tokens spanning multiple context compactions in a single run. (2/5)

46d248101

Shashwat Goel@ShashwatGoel7

How do agents perform? All models improve in accuracy over time. GPT 5.5 outperforms Opus 4.6 by large margins.

The skill score is more informative, showing most models are worse than no prediction (0). Qwen 3.6 keeps becoming more overconfident. DeepSeek adapts to beat GLM 5.1

46d182101

Shashwat Goel@ShashwatGoel7

We're glad you liked it :) We also have lots of cool results in the thread beyond frontier models, like how multi agents hivemind, DeepSeek is ~best at test time adaptation, and the value of good search agents.

Opus 4.7 was behaving worse than 4.6 in initial tests, + is trained on Jan. We could soon bench it on later months, but hopefully by then Mythos is released anyway.

and likewise for Gemini we'd need a LOT of API credits (1000$ per run). Currently we benchmark through coding plans, but antigravity is unreliable.

45d46061

Nikhil Chandak@nikhilchandak29

FutureSim enables research on multiple emerging research directions like search, memory, reasoning under certainty and multi-agent systems evidence each of which provide in our blog below. (5/5)

Blog: https://openforecaster.github.io/futuresim/ Code: https://github.com/OpenForecaster/futuresim Paper: https://arxiv.org/abs/2605.15188

46d12271

Shashwat Goel@ShashwatGoel7

More details, code, and trajectories in:

Blog: https://openforecaster.github.io/futuresim/

Paper: https://www.alphaxiv.org/abs/2605.15188

Its been great fun building this with the boys, @nikhilchandak29 and @arvindh__a. Do checkout their threads (and give them a follow!) for more perspectives.

And ofc, this was possible thanks to the support of our advisors @jonasgeiping @maksym_andr Moritz Hardt Steffen Staab and @AmyPrb

46d115111

Pit Schultz@pitsch

@teortaxesTex "Most SOTA LLMs lose money in live/agentic trading due to overconfidence, slippage, and poor long-horizon adaptation." https://huggingface.co/blog/charles-azam/predibench

44d16312

Shashwat Goel@ShashwatGoel7

In FutureSim, the context corpus evolves each day, with real news articles. Agents can learn from this, and also ground-truth as it becomes known.

Agents decide when and which predictions to update, over a longgg horizon: 3 months, multiple compactions, 1000s of actions per run

46d2109

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

V3.2 can't make use of multi-agent here.

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

actually lots of interesting empirical results here, that go beyond "forecasting the future eval". Models differ a lot in how they respond to starting from the worst agent's outputs! V4 is generally bad here but recovers the most. Is it just an issue of in- vs out- distribution?

44d1.5K100