/AI3h ago

Fable Benchmark Demonstrates Strong Calibration for AI Self-Assessment

112024.3K
Original post
rohit@krishnanrohit#1214inAI

馃毃 Fable benchmark. Tried to update MarketBench too, with Fable cc @AndreyFradkin . Fable is very good (much better calibrated) on judging its own capabilities - mean stated confidence 0.85 against a realized 87% pass rate, Brier 0.117, and its rare low-confidence calls landed on genuine traps.

There's probably some leakage here though of the questions, the model seemed to *know* what to answer reading its writing, but the level of contagion is hard to parse.

3:19 AM 路 Jun 10, 2026 路 2.3K Views
Sentiment
Sentiment building, check back later.
Cluster Engagement
Posts from X
Most Activity
Most Activity
VIEWS1.9KBOOKMARKS1
rohit@krishnanrohit

It bid a flat 0.88 - 0.93 on everything because it said stuff about remembered gold patches. I didn't push it more because contamination makes it hard!

rohit@krishnanrohit

馃毃 Fable benchmark. Tried to update MarketBench too, with Fable cc @AndreyFradkin . Fable is very good (much better calibrated) on judging its own capabilities - mean stated confidence 0.85 against a realized 87% pass rate, Brier 0.117, and its rare low-confidence calls landed on genuine traps.

There's probably some leakage here though of the questions, the model seemed to *know* what to answer reading its writing, but the level of contagion is hard to parse.

3hViews 1.9KLikes 0Bookmarks 1