21h ago

FutureSim benchmark evaluates AI agents on continual learning

0

FutureSim introduces a benchmark that tests frontier AI agents on continual learning by feeding sequential real-world news updates in chronological order. It measures GPT-5.5 forecast updates for the Seattle Seahawks Super Bowl win probability across January 19 to February 7 and for Balen Shah becoming Nepal prime minister from near zero to 74 percent between February 25 and March 6, recording running probabilities, update counts, and Brier skill scores.

Original post

Continual learning is bottlenecked by realistic evaluations Introducing FutureSim, which replays real-world events in the temporal order they occurred We benchmark frontier agents at updating predictions about how our world evolves, in native harnesses like Codex, Claude Code

10:14 AM · May 15, 2026 View on X
Reposted by

new forecasting benchmark: FutureSim

GPT-5.5 performs the best at 25%, but Mythos, Gemini 3.1 Pro and Opus 4.7 are not included. Based on their Brier Skill Score the models don't seem to be much better than just assigning equal probabilities to all outcomes

Arvindh ArunArvindh Arun@arvindh__a

Introducing FutureSim: where we replay a temporal slice of the web and let agents forecast real-world events over time 🔮🌎 FutureSim replays the web day by day. Agents start on Jan 1, 2026 (past their knowledge cutoffs) with date-gated access to real news articles and forecast on real-world events resolving over the next 90 days. Around 244K new articles stream in during the simulation. Agents decide which questions to answer, what to search for, and when to advance to the next day 🤔 We evaluate frontier models in their native harness. GPT 5.5 (Codex) leads at 25% acc, followed by Opus 4.6 (Claude Code) at 20% 📈 Open weight frontier models have a significant gap to catch up, with DeepSeek V4 pro at 13%, GLM 5.1 at 10%, and Qwen3.6 Plus at 5% On some questions that have a parallel @Polymarket market, we find that GPT 5.5 in our simulation sometimes beats the crowd aggregate, like in the Super Bowl LX ($704M traded) market 💰💸 FutureSim serves as a test bed for evaluating a lot of important agentic capabilities > Adaptation: how agents adapt beliefs over time, and handle new incoming information and environment feedback > Memory: how agents make the best use of external memory to store persistent insights and handle context limitations over a thousand tool calls > Search: how agents find relevant information over thousands of articles streaming in > Inference scaling: how agents benefit from scaling inference compute More cool insights and deep dives in our paper 👇

5:15 PM · May 15, 2026 · 16.4K Views
2:09 PM · May 16, 2026 · 4.2K Views

💥 Check out our new paper: FutureSim: Replaying World Events to Evaluate Adaptive Agents.

We create a *reproducible* long-horizon environment where agents have to make forecasts during a 3-month period.

The best performing agent, GPT 5.5 in Codex, consumes 3700 turns and 12.4M tokens spanning many sequential context window compactions in a single run.

(Led by @ShashwatGoel7, @nikhilchandak29, @arvindh__a!)

Shashwat GoelShashwat Goel@ShashwatGoel7

Continual learning is bottlenecked by realistic evaluations Introducing FutureSim, which replays real-world events in the temporal order they occurred We benchmark frontier agents at updating predictions about how our world evolves, in native harnesses like Codex, Claude Code

5:14 PM · May 15, 2026 · 32.9K Views
5:59 PM · May 15, 2026 · 2.9K Views

What else have we been up to? As models get better and work over longer and longer time horizons, how do we even evaluate how well they can act and adapt?

One domain we really like there is forecasting, as a hard task that test reasoning under uncertainty.

We've made a benmchmark out of this, where we simulate a whole 3 month period of news, and sanboxed let models continuously read news from those days, plan, and update their forecasts. (see the animation below, just don't be fooled by its speed, this is a slice of the larger 12m token trajectory)

Many more details linked below:

Shashwat GoelShashwat Goel@ShashwatGoel7

Continual learning is bottlenecked by realistic evaluations Introducing FutureSim, which replays real-world events in the temporal order they occurred We benchmark frontier agents at updating predictions about how our world evolves, in native harnesses like Codex, Claude Code

5:14 PM · May 15, 2026 · 32.9K Views
5:49 PM · May 15, 2026 · 3.1K Views

Continual learning is bottlenecked by realistic evaluations

Introducing FutureSim, which replays real-world events in the temporal order they occurred

We benchmark frontier agents at updating predictions about how our world evolves, in native harnesses like Codex, Claude Code

5:14 PM · May 15, 2026 · 32.9K Views
FutureSim benchmark evaluates AI agents on continual learning · Digg