I'd have said that the verifiable tasks have been improving impressively (via RLVR) with some improvement transferring to non-verifiable tasks, but notably less. And I'd have guessed the latter are following a different curve (with a lower plateau).
Anthropic's recent post 'When AI Builds Itself' included the following claim that I thought was crucial, yet unsupported. Are all measurable capabilities really improving on the same curve? What is the best evidence for this?