/AI23h ago

Claude Opus 4.8 Drops On LLM Debate Benchmark As Ernie 5.1 Gains

8503106.3K
Original postLisan al Gaib#975
Lech Mazur@LechMazur

Opus 4.8 underperforms Opus 4.7 on the LLM Debate Benchmark (1717 → 1697), but Claude is still dominating the leaderboard.

Qwen 3.7 Max scores worse than Qwen 3.6 Max: 1540 → 1499. Step 3.7 Flash lands at 1457. Ernie 5.1 improves a lot over Ernie 5.0: 1311 → 1447.

10:46 AM · Jun 5, 2026 · 6.3K Views
Sentiment
Sentiment building, check back later.
Cluster Engagement
Posts from X
Most Activity
Most Activity
VIEWS64LIKES8
Lech Mazur@LechMazur

This benchmark measures how well LLMs perform in multi-turn debates across a wide range of topics. Each matchup runs twice on the same topic with sides swapped. A 3-model judge panel decides winner and margin.

The heatmap shows how models perform against each other:

23hViews 64Likes 8Bookmarks 1
BOOKMARKS1
Lech Mazur@LechMazur

More info, including transcripts: https://github.com/lechmazur/debate

23hViews 56Likes 4Bookmarks 1
REPLIES1
Lech Mazur@LechMazur

Decisive Cross-Judge Agreement:

23hViews 21Likes 1
Lech Mazur@LechMazur

Scale is relative within this pool and centered near 1500. This is not an absolute capability score. Bradley-Terry is Elo-like.

Price vs. Performance:

23hViews 38Likes 5Bookmarks 1
PeterJot@PeterJot

@LechMazur Interesting outcome... 🤔

22hViews 8