congrats to Anthropic for great progress in sandbagging! The competitors can't distill your capabilities if you don't ship them! That's the winner's attitude. In the end, there's not much difference between honestly serving tokens and renting out your GPUs…
Claude Sonnet 5 (high) scores 68.8% on WeirdML, comparable to GLM-5.2, and up from Sonnet 4.6 at 66.1%.
It seems different from Sonnet 4.6, and it does the Opus thing of sometimes just exploring the data instead of trying to solve the task.





