Sakana AI Labs Unveils CoffeeBench for LLM Agent Business Simulations

Original post

SakanaAIは、有限責任あずさ監査法人と共同で、LLMエージェントの長期的な経営能力を評価する新しいベンチマーク「CoffeeBench」を公開しました。

ブログ：https://sakana.ai/coffee-bench/

現実の経済では、消費者へ直接売るビジネスだけでなく、企業同士が継続的に取引するビジネスも重要です。CoffeeBench は、農家・焙煎店・小売店の計6社が参加するコーヒー業界のサプライチェーンをシミュレーションし、各社をLLMエージェントが運営。90日間にわたって価格交渉・発注・在庫管理などを行い、純利益の最大化を目指します。

最新のLLMを同じ環境で競わせると、経営成績は大きく分かれました。積極的に交渉し、利益に直結する一手を打ち続けるモデルがいる一方で、自身の状況を分析しながらも行動に移さず、待機し続けて赤字に陥るモデルも出てくるなど、長期タスクならではの振る舞いの違いが観察できました。

CoffeeBenchは、長期にわたり相互作用するLLMエージェントの能力や振る舞いを評価・分析していくための第一歩です。今後は、複数エージェント間で生じる協調・競争・逸脱行動や、その監査・ガバナンス手法の研究へと発展させていくことを目指します。

本研究は ICML2026 Workshop "Failure Modes in Agentic AI" にて発表予定です。

論文：https://arxiv.org/abs/2606.16613 ☕

11:06 PM · Jun 25, 2026 · 162.2K Views

SAKANA.AIVia

ARXIV.ORGVia

VIEWS420RETWEETS3

Takashi Ishida // ICML 2026@tksii

Excited to share CoffeeBench!!☕️☕️☕️

We evaluate LLM agents in a 90-day B2B coffee supply-chain economy spanning farmers, roasters, and retailers, where these firms negotiate, manage inventory, set prices, handle invoices, and manage cash flow.

Beyond evaluating long-horizon business performance, such as whether agents can improve net income, I'm also excited about the accounting and AI safety angle: because CoffeeBench includes B2B trade, invoices, and cash-flow constraints, it could potentially help us study whether stronger future agents develop or discover problematic business behaviors such as circular trading, channel stuffing, or accounting-fraud-like strategies.

This was an exciting cross-disciplinary collaboration with researchers at KPMG AZSA @KPMG_JP and @SakanaAILabs colleagues @strayer_13 (first author!) and @taromakino 🤝

The work will be presented at @FAGENWorkshop in ICML 2026!🇰🇷

Sakana AI@SakanaAILabs

SakanaAIは、有限責任あずさ監査法人と共同で、LLMエージェントの長期的な経営能力を評価する新しいベンチマーク「CoffeeBench」を公開しました。

ブログ：https://sakana.ai/coffee-bench/

本研究は ICML2026 Workshop "Failure Modes in Agentic AI" にて発表予定です。

論文：https://arxiv.org/abs/2606.16613 ☕

10h9.3K2711

Sakana AI@SakanaAILabs

CoffeeBenchでは、6体のエージェントがメールや取引で相互作用し、各社が利益の最大化を目指します。LLMエージェントが経営を担う社会が来たときに、協調や競争、ときに不正はどう現れるのか。CoffeeBenchは、それを観察するための実験場でもあります。

10h9.7K6616