Håvard Ihle's agentic WeirdML benchmark tests push Claude Opus 4.7 and GPT-5.5 to 90% accuracy · Digg