22h ago

TERMS-Bench Ranks Claude Opus 4.6 First in LLM Economic Negotiations

15168198020.6K

——0——

Original post

We built TERMS-Bench, a three-tier benchmark for LLM agents in real-world economic negotiation. No LLM-as-judge, no outcome rubrics: the environment itself is the verifier. 🏆Among frontier models, @AnthropicAI Claude Opus 4.6 #1, @Zai_org GLM 5.1 #2. ✨Surprisingly strong: @GoogleDeepMind @googlegemma Gemma 4 31B — best open-weight, holds up as negotiations get harder. 🔗 http://terms-bench.github.io

9:30 PM · May 16, 2026

Cluster engagement

83 snapshots

Reposted by

#690@JAMES_Y_ZOU