For me, the ultimate intelligence benchmark is still text writing. Which is kinda ironic, because models that were initially created to generate text are still often that bad at it.
If you write professionally, you quickly notice that it’s usually easier to write the text yourself than to edit whatever the model generated. Today, only the largest models can be genuinely helpful with that. That’s GPT-5.5 Pro — which, judging by its behavior, feels to me like a successor to the sunsetted 4.1 large model — and, as of today, Fable too.
My benchmark is pretty simple: I take large portions of fiction books by no-name authors, including me 😅, and ask the model to continue them. Only the larger models are really able to capture the voice, the barely noticeable nuances of vocabulary, and the author’s biases. Smaller models quickly fall back into default, flavorless narration, with a few bits of pretentiousness sprinkled in.