21h ago

FutureSim benchmark evaluates AI agents on continual learning

315938531965.1K

——0——

FutureSim introduces a benchmark that tests frontier AI agents on continual learning by feeding sequential real-world news updates in chronological order. It measures GPT-5.5 forecast updates for the Seattle Seahawks Super Bowl win probability across January 19 to February 7 and for Balen Shah becoming Nepal prime minister from near zero to 74 percent between February 25 and March 6, recording running probabilities, update counts, and Brier skill scores.

Original post

#1139@XEOPHON @SHASHWATGOEL7

Shashwat Goel#1777@SHASHWATGOEL7

Continual learning is bottlenecked by realistic evaluations Introducing FutureSim, which replays real-world events in the temporal order they occurred We benchmark frontier agents at updating predictions about how our world evolves, in native harnesses like Codex, Claude Code

10:14 AM · May 15, 2026

Cluster engagement

128 snapshots

Reposted by

#1174@MAKSYM_ANDR

#1139@XEOPHON

#204@AGARWL_

QUOTE POST

#984Lisan al Gaib@SCALING01

new forecasting benchmark: FutureSim

GPT-5.5 performs the best at 25%, but Mythos, Gemini 3.1 Pro and Opus 4.7 are not included. Based on their Brier Skill Score the models don't seem to be much better than just assigning equal probabilities to all outcomes

Arvindh Arun@arvindh__a

Introducing FutureSim: where we replay a temporal slice of the web and let agents forecast real-world events over time 🔮🌎 FutureSim replays the web day by day. Agents start on Jan 1, 2026 (past their knowledge cutoffs) with date-gated access to real news articles and forecast on real-world events resolving over the next 90 days. Around 244K new articles stream in during the simulation. Agents decide which questions to answer, what to search for, and when to advance to the next day 🤔 We evaluate frontier models in their native harness. GPT 5.5 (Codex) leads at 25% acc, followed by Opus 4.6 (Claude Code) at 20% 📈 Open weight frontier models have a significant gap to catch up, with DeepSeek V4 pro at 13%, GLM 5.1 at 10%, and Qwen3.6 Plus at 5% On some questions that have a parallel @Polymarket market, we find that GPT 5.5 in our simulation sometimes beats the crowd aggregate, like in the Super Bowl LX ($704M traded) market 💰💸 FutureSim serves as a test bed for evaluating a lot of important agentic capabilities > Adaptation: how agents adapt beliefs over time, and handle new incoming information and environment feedback > Memory: how agents make the best use of external memory to store persistent insights and handle context limitations over a thousand tool calls > Search: how agents find relevant information over thousands of articles streaming in > Inference scaling: how agents benefit from scaling inference compute More cool insights and deep dives in our paper 👇

5:15 PM · May 15, 2026 · 16.4K Views

2:09 PM · May 16, 2026 · 4.2K Views

QUOTE POST

#1174Maksym Andriushchenko@MAKSYM_ANDR

💥 Check out our new paper: FutureSim: Replaying World Events to Evaluate Adaptive Agents.

We create a *reproducible* long-horizon environment where agents have to make forecasts during a 3-month period.

The best performing agent, GPT 5.5 in Codex, consumes 3700 turns and 12.4M tokens spanning many sequential context window compactions in a single run.

(Led by @ShashwatGoel7, @nikhilchandak29, @arvindh__a!)

Shashwat Goel@ShashwatGoel7

5:14 PM · May 15, 2026 · 32.9K Views

5:59 PM · May 15, 2026 · 2.9K Views

QUOTE POST

#1582Jonas Geiping@JONASGEIPING

What else have we been up to? As models get better and work over longer and longer time horizons, how do we even evaluate how well they can act and adapt?

One domain we really like there is forecasting, as a hard task that test reasoning under uncertainty.

We've made a benmchmark out of this, where we simulate a whole 3 month period of news, and sanboxed let models continuously read news from those days, plan, and update their forecasts. (see the animation below, just don't be fooled by its speed, this is a slice of the larger 12m token trajectory)

Many more details linked below:

Shashwat Goel@ShashwatGoel7

5:14 PM · May 15, 2026 · 32.9K Views

5:49 PM · May 15, 2026 · 3.1K Views

ORIGINAL POST

#1777Shashwat Goel@SHASHWATGOEL7

Continual learning is bottlenecked by realistic evaluations

Introducing FutureSim, which replays real-world events in the temporal order they occurred

We benchmark frontier agents at updating predictions about how our world evolves, in native harnesses like Codex, Claude Code

5:14 PM · May 15, 2026 · 32.9K Views