Added to prinzbench: Opus-4.8.
For the very first time, the Max setting was available to me in the Claude app when I used this model. Using this setting, Claude's performance improved dramatically vs. all prior Anthropic models. Opus-4.8 (Max) scored 42/99 on prinzbench, as compared to 25/99 for Opus 4.7 (Extended).
This was the second-highest score of all tested models to date for a model: (i) not released by OpenAI, and (ii) not utilizing a multi-agent setup or parallelized compute. (Gemini 3.1 Pro is still the best such model, having scored 50/99.)
I am now very curious about how the "Mythos-class models" that Anthropic has promised to release in the near future will perform on my benchmark.