1d ago

Claude Opus 4.8 Max takes first on AutomationBench with 15.5%, but critics dispute the model hierarchy

Gemini 3.5 Flash (Low) unexpectedly outscored GPT-5.5 (High).

Sentiment

Pos0%

Neg100%

Many users dismissed Claude Opus 4.8 topping the AutomationBench leaderboard as unreliable or mistaken, citing insufficient error margins and possible benchmark flaws.

4 comments with sentiment.

Claude Opus 4.8 Max takes first on AutomationBench with 15.5%, but critics dispute the model hierarchy · Digg