Opus 4.8 underperforms Opus 4.7 on the LLM Debate Benchmark (1717 → 1697), but Claude is still dominating the leaderboard.
Qwen 3.7 Max scores worse than Qwen 3.6 Max: 1540 → 1499. Step 3.7 Flash lands at 1457. Ernie 5.1 improves a lot over Ernie 5.0: 1311 → 1447.

