馃毃 Fable benchmark. Tried to update MarketBench too, with Fable cc @AndreyFradkin . Fable is very good (much better calibrated) on judging its own capabilities - mean stated confidence 0.85 against a realized 87% pass rate, Brier 0.117, and its rare low-confidence calls landed on genuine traps.
There's probably some leakage here though of the questions, the model seemed to *know* what to answer reading its writing, but the level of contagion is hard to parse.