22h ago

TERMS-Bench Ranks Claude Opus 4.6 First in LLM Economic Negotiations

0
Original post

We built TERMS-Bench, a three-tier benchmark for LLM agents in real-world economic negotiation. No LLM-as-judge, no outcome rubrics: the environment itself is the verifier. 🏆Among frontier models, @AnthropicAI Claude Opus 4.6 #1, @Zai_org GLM 5.1 #2. ✨Surprisingly strong: @GoogleDeepMind @googlegemma Gemma 4 31B — best open-weight, holds up as negotiations get harder. 🔗 http://terms-bench.github.io

9:30 PM · May 16, 2026 View on X
Reposted by
TERMS-Bench Ranks Claude Opus 4.6 First in LLM Economic Negotiations · Digg