/AI23h ago

Claude Opus 4.8 Drops On LLM Debate Benchmark As Ernie 5.1 Gains

8503106.3K

#975

Original post

Lisan al Gaib#975

Lech Mazur@LechMazur

Opus 4.8 underperforms Opus 4.7 on the LLM Debate Benchmark (1717 → 1697), but Claude is still dominating the leaderboard.

Qwen 3.7 Max scores worse than Qwen 3.6 Max: 1540 → 1499. Step 3.7 Flash lands at 1457. Ernie 5.1 improves a lot over Ernie 5.0: 1311 → 1447.

10:46 AM · Jun 5, 2026 · 6.3K Views

Sentiment

Sentiment building, check back later.

Cluster Engagement

Posts from X

Most Activity

VIEWS64LIKES8

Lech Mazur@LechMazur

This benchmark measures how well LLMs perform in multi-turn debates across a wide range of topics. Each matchup runs twice on the same topic with sides swapped. A 3-model judge panel decides winner and margin.

The heatmap shows how models perform against each other:

23h6481

BOOKMARKS1

Lech Mazur@LechMazur

More info, including transcripts: https://github.com/lechmazur/debate

23h5641

REPLIES1

Lech Mazur@LechMazur

Decisive Cross-Judge Agreement:

23h211

Lech Mazur@LechMazur

Scale is relative within this pool and centered near 1500. This is not an absolute capability score. Bradley-Terry is Elo-like.

Price vs. Performance:

23h3851

Lech Mazur@LechMazur

Topics:

23h633

PeterJot@PeterJot

@LechMazur Interesting outcome... 🤔

22h8