🚨 Claude Fable 5 on FutureSim 🚨
While Anthropic folks have been using it internally for predicting evals, we actually put it to test on FutureSim.
We found it has very strong priors but fails to update predictions over time and ends up no better than GPT-5.5!
We report our results on Feb-March subset to minimize contamination with Fable's knowledge cutoff of Jan.
we don’t even run evals anymore we just ask Claude what the score will be