/AI2h ago

Nikhil Chandak finds Claude Fable 5 and GPT-5.5 fail to dynamically update forecasts, maintaining flat 20% FutureSim accuracy

The evaluation used post-cutoff data to prevent knowledge contamination.

127432.6K

#975

Original post

Maksym Andriushchenko#1063

Nikhil Chandak@nikhilchandak29

🚨 Claude Fable 5 on FutureSim 🚨

While Anthropic folks have been using it internally for predicting evals, we actually put it to test on FutureSim.

We found it has very strong priors but fails to update predictions over time and ends up no better than GPT-5.5!

We report our results on Feb-March subset to minimize contamination with Fable's knowledge cutoff of Jan.

Sholto Douglas@_sholtodouglas

we don’t even run evals anymore we just ask Claude what the score will be

4:51 AM · Jun 10, 2026 · 5.6K Views

/AI2h ago

Nikhil Chandak finds Claude Fable 5 and GPT-5.5 fail to dynamically update forecasts, maintaining flat 20% FutureSim accuracy

The evaluation used post-cutoff data to prevent knowledge contamination.

127432.6K

#975

Original post

Maksym Andriushchenko#1063

Nikhil Chandak@nikhilchandak29

🚨 Claude Fable 5 on FutureSim 🚨

While Anthropic folks have been using it internally for predicting evals, we actually put it to test on FutureSim.

We found it has very strong priors but fails to update predictions over time and ends up no better than GPT-5.5!

We report our results on Feb-March subset to minimize contamination with Fable's knowledge cutoff of Jan.

Sholto Douglas@_sholtodouglas

we don’t even run evals anymore we just ask Claude what the score will be

4:51 AM · Jun 10, 2026 · 5.6K Views

Sentiment

Sentiment building, check back later.

Cluster Engagement

Posts from X

Most Activity

VIEWS406LIKES5

Maksym Andriushchenko@maksym_andr

it's impressive how quickly @nikhilchandak29 added the Fable 5 results to our FutureSim benchmark!

Fable 5 is at 20% accuracy - and y'all are saying we don't have unsaturated evals? :-)

Nikhil Chandak@nikhilchandak29

🚨 Claude Fable 5 on FutureSim 🚨

While Anthropic folks have been using it internally for predicting evals, we actually put it to test on FutureSim.

We found it has very strong priors but fails to update predictions over time and ends up no better than GPT-5.5!

We report our results on Feb-March subset to minimize contamination with Fable's knowledge cutoff of Jan.

1h40650

RETWEETS3

Nikhil Chandak@nikhilchandak29

🚨 Claude Fable 5 on FutureSim 🚨

While Anthropic folks have been using it internally for predicting evals, we actually put it to test on FutureSim.

We found it has very strong priors but fails to update predictions over time and ends up no better than GPT-5.5!

We report our results on Feb-March subset to minimize contamination with Fable's knowledge cutoff of Jan.

Sholto Douglas@_sholtodouglas

we don’t even run evals anymore we just ask Claude what the score will be

2h5.6K375