My impressions on GPT-5.6, having asked around:
- The 5.5 base (that 5.6 inherits) is fundamentally weaker than the larger Mythos/Fable base
- With some good RL, 5.6 can beat Fable, but only with everything maxed out (Sol Ultra, so multiple Sol agents on max effort)
- OpenAI were very selective with the benchmarks they published for a reason - I doubt the results we see from other notable benchmarks once this is released will be as significant of a jump from 5.5
- 5.6 is a heinous reward hacker, and while all models do "cheat" on benchmarks, GPT is the most aggressive (see yesterday's METR results). This combined with some other conversations makes me think Fable will still feel like a better model in real-world use
- The price is perhaps the most attractive thing about 5.6 - 5/30 is significantly better than Fable's 10/50 - but Fable can do more with less tokens in most cases
- Terra and Luna look great for price-performance, but could feel a lot worse in actual use vs TBench 2.1 results
- Personally my go-to is unlikely to change. Fable is a beast and a great model to use, and once it's back I won't hesitate to use it as my default again. But 5.6 will be great for checking Fable's work and the very rare instance where Fable gets stuck