FutureSim benchmark evaluates AI agents on continual learning
FutureSim introduces a benchmark that tests frontier AI agents on continual learning by feeding sequential real-world news updates in chronological order. It measures GPT-5.5 forecast updates for the Seattle Seahawks Super Bowl win probability across January 19 to February 7 and for Balen Shah becoming Nepal prime minister from near zero to 74 percent between February 25 and March 6, recording running probabilities, update counts, and Brier skill scores.
💥 Check out our new paper: FutureSim: Replaying World Events to Evaluate Adaptive Agents.
We create a *reproducible* long-horizon environment where agents have to make forecasts during a 3-month period.
The best performing agent, GPT 5.5 in Codex, consumes 3700 turns and 12.4M tokens spanning many sequential context window compactions in a single run.
(Led by @ShashwatGoel7, @nikhilchandak29, @arvindh__a!)
Continual learning is bottlenecked by realistic evaluations Introducing FutureSim, which replays real-world events in the temporal order they occurred We benchmark frontier agents at updating predictions about how our world evolves, in native harnesses like Codex, Claude Code
What else have we been up to? As models get better and work over longer and longer time horizons, how do we even evaluate how well they can act and adapt?
One domain we really like there is forecasting, as a hard task that test reasoning under uncertainty.
We've made a benmchmark out of this, where we simulate a whole 3 month period of news, and sanboxed let models continuously read news from those days, plan, and update their forecasts. (see the animation below, just don't be fooled by its speed, this is a slice of the larger 12m token trajectory)
Many more details linked below:
Continual learning is bottlenecked by realistic evaluations Introducing FutureSim, which replays real-world events in the temporal order they occurred We benchmark frontier agents at updating predictions about how our world evolves, in native harnesses like Codex, Claude Code
Continual learning is bottlenecked by realistic evaluations
Introducing FutureSim, which replays real-world events in the temporal order they occurred
We benchmark frontier agents at updating predictions about how our world evolves, in native harnesses like Codex, Claude Code