Introducing FutureSim: where we replay a temporal slice of the web and let agents forecast real-world events over time 🔮🌎
FutureSim replays the web day by day. Agents start on Jan 1, 2026 (past their knowledge cutoffs) with date-gated access to real news articles and forecast on real-world events resolving over the next 90 days. Around 244K new articles stream in during the simulation. Agents decide which questions to answer, what to search for, and when to advance to the next day 🤔
We evaluate frontier models in their native harness. GPT 5.5 (Codex) leads at 25% acc, followed by Opus 4.6 (Claude Code) at 20% 📈 Open weight frontier models have a significant gap to catch up, with DeepSeek V4 pro at 13%, GLM 5.1 at 10%, and Qwen3.6 Plus at 5%
On some questions that have a parallel @Polymarket market, we find that GPT 5.5 in our simulation sometimes beats the crowd aggregate, like in the Super Bowl LX ($704M traded) market 💰💸
FutureSim serves as a test bed for evaluating a lot of important agentic capabilities
> Adaptation: how agents adapt beliefs over time, and handle new incoming information and environment feedback
> Memory: how agents make the best use of external memory to store persistent insights and handle context limitations over a thousand tool calls
> Search: how agents find relevant information over thousands of articles streaming in
> Inference scaling: how agents benefit from scaling inference compute
More cool insights and deep dives in our paper 👇