/Tech6h ago

Artificial Analysis launches AA-Briefcase to benchmark long-horizon agentic AI work, with Claude Fable 5 leading

The benchmark simulates multi-week projects using thousands of documents.

7167928214111.5K

#90

Original post

Ethan Mollick@emollick#184inTech

I have given AA a hard time about its previous agentic evaluation but this looks like a good and impressive benchmark for real world knowledge work that is unsaturated and had private hold out tests.

This is one to watch - I didn’t see a human comparison score though?

Artificial Analysis@ArtificialAnlys

Announcing AA-Briefcase, the benchmark for the next era of agentic knowledge work

AA-Briefcase is our new benchmark for testing models on long-horizon knowledge work tasks in complex projects built by industry experts. Models are evaluated on multi-week projects, each with many linked tasks and thousands of input source files.

We evaluated Claude Fable 5 from @AnthropicAI before it became unavailable, and it currently leads with an Elo score of 1587, followed by Claude Opus 4.8 (max, 1356), Opus 4.7, and the recently-released GLM 5.2 (max, 1266) from @Zai_org.

Claude Fable 5 cost $31 on average to run each AA-Briefcase task, followed by Claude Opus 4.8 at $10.40, GPT-5.5 (xhigh) at $3.68 and GLM-5.2 (max) at $2.40.

AA-Briefcase comprises four private scenarios, each representing a multi-week knowledge work project set in a realistic organizational context. A public fifth scenario has been released via @huggingface as a representation of scenario structure, submission, and grading (AA-Briefcase Lite). This does not count toward official AA-Briefcase results, and is demonstrative only.

Key elements of AA-Briefcase:

➤ Realistic long-horizon projects: AA-Briefcase moves beyond single, disconnected prompts by evaluating models across a coherent long-horizon project. Tasks build week by week, draw on shared institutional context, and require deliverables such as financial models, board presentations, and design mock-ups

➤ Large volumes of fragmented context: AA-Briefcase requires models to reason across thousands of inputs, including company documents, meeting transcripts, large-scale data exports, 25,000+ Slack messages and 3,500+ emails. These sources are fragmented, messy, and often contain realistic contradiction, testing whether models can navigate the ambiguity of real-world knowledge work

➤ Composite rubric and pairwise grading: AA-Briefcase combines binary rubric checks for ground-truth correctness with pairwise grading on analytical quality and presentation quality. Unlike many evaluations that focus on a single metric, AA-Briefcase tests agentic capabilities more comprehensively, exposing cases where models produce outputs that look polished but are incorrect or lack analytical rigor

➤ Built by industry experts: AA-Briefcase scenarios mirror real-world knowledge work, with tasks developed over months by experts across data science, product management and corporate strategy from companies including Google, McKinsey & Company and BCG. Task challenges are drawn from professional experience, making AA-Briefcase more reflective of the ambiguity, messy context and competing priorities that define real-world knowledge work

Key results:

➤ Claude Fable 5 leads AA-Briefcase at 1587 Elo: This is followed by Claude Opus 4.8 (1356) with the next-best non-Anthropic model, GLM-5.2 (max), ~90 points back at 1266. Note that Claude Fable 5 did not use the Opus 4.8 fallback for any task in AA-Briefcase

➤ Cost per task varies by ~800x across models tested: Claude Fable 5 leads the benchmark but costs more than $31 per task on average, compared to ~$0.04 for DeepSeek V4 Flash (max). The strongest price/performance options are open weights models such as GLM-5.2 (max) and DeepSeek V4 Pro (max), with GLM-5.2 (max) scoring only ~90 Elo below Claude Opus 4.8 (max) for less than 25% of the cost

➤ Real-world complexity remains difficult for models: The top performer, Claude Fable 5, satisfies all rubric criteria on just 3% of AA-Briefcase tasks. On 31 of 91 tasks, no model scores above 50% on the rubric criteria

➤ Task difficulty scales with the number of required input files: For each rubric check, we identify the set of source files needed to pass. Across all models, pass rates fall as this file count increases, though top-tier models degrade less than weaker models

More details below in thread ⬇️

4:51 PM · Jun 18, 2026 · 22.4K Views

Sentiment

Positive users praise GLM-5.2's strong results and the AA-Briefcase benchmark for non-gameable agentic work while negative users distrust its scoring and mock high costs with low outcomes.

Pos

71.3%

Neg

28.7%

23 comments with sentiment.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Posts from X

Most Activity

VIEWS2.5KLIKES27

Artificial Analysis@ArtificialAnlys

AA-Briefcase is a new benchmark for testing models on realistic knowledge work tasks in complex projects built by industry experts. Models are evaluated on multi-week knowledge work projects, each with many linked tasks and thousands of input source files. AA-Briefcase combines rubric and pairwise grading to evaluate verifiable task success, analytical quality, and presentation quality, giving a holistic view of overall agentic capability in knowledge work

7h2.5K272

BOOKMARKS3

Artificial Analysis@ArtificialAnlys

Cost per task on AA-Briefcase spans ~800x across models tested, from $0.04 for DeepSeek V4 Flash to over $31 for Claude Fable 5 (currently unavailable). The top Elo scores come at a steep premium, with Claude Fable 5 and Opus 4.8 leading but also the two most expensive to run. The strongest price/performance options are open weights models such as GLM-5.2 (max) and DeepSeek V4 Pro (max), with GLM-5.2 (max) scoring only ~90 Elo below Claude Opus 4.8 (max) for less than 25% of the cost

7h1.2K163

RETWEETS13

Artificial Analysis@ArtificialAnlys

Announcing AA-Briefcase, the benchmark for the next era of agentic knowledge work

Claude Fable 5 cost $31 on average to run each AA-Briefcase task, followed by Claude Opus 4.8 at $10.40, GPT-5.5 (xhigh) at $3.68 and GLM-5.2 (max) at $2.40.

Key elements of AA-Briefcase:

Key results:

More details below in thread ⬇️

7h90.7K505159

REPLIES2

Artificial Analysis@ArtificialAnlys

Tool use behavior varies by model family. Anthropic and MiniMax models make heavy use of visual inspection, with Claude Fable 5 and Claude Opus 4.8 (max) averaging ~21 and ~12 view image calls per task, respectively. These models lead in both overall Elo and Presentation Elo, suggesting that repeatedly inspecting rendered outputs is an important part of producing strong deliverables

7h79615

Artificial Analysis@ArtificialAnlys

Tasks with many messy input files, conflicting information, and complex deliverables remain difficult for all models. Under a strict all-or-nothing grading scheme per task, Claude Fable 5 leads overall, but achieves a perfect task score on only 3% of tasks. On 31 of 91 tasks, no model scores above 50%

7h2K232

Artificial Analysis@ArtificialAnlys

Sample AA-Briefcase outputs from a single commercial due diligence task show how far model quality varies. Claude Fable 5 (currently unavailable) produces a clean and structured market map with footnoted figures and flagged uncertainties; GPT-5.5 (xhigh) creates a relatively detailed document albeit some layout issues; Gemini 3.1 Pro generates a very sparse diagram without supplemental analysis

7h2.4K142

Artificial Analysis@ArtificialAnlys

See our launch article for further details and analysis: http://artificialanalysis.ai/articles/aa-briefcase

Full results on our website: https://artificialanalysis.ai/evaluations/aa-briefcase

7h1.9K152

Artificial Analysis@ArtificialAnlys

Less capable models most often fail at task execution, missing relevant input files, submitting unusable deliverables, or producing no deliverable at all. More capable models, measured by overall rubric pass rate, more often fail to fulfill all task requirements, including those embedded in the original task or hidden across source files

7h817161

Artificial Analysis@ArtificialAnlys

For each rubric check, we identify the minimum set of files a model must read to pass. High-performance models (averaging ≥30% rubric pass rate) fall from ~55% on prompt-only checks to ~40% on checks requiring 5+ files

7h729141

Artificial Analysis@ArtificialAnlys

Total token usage on AA-Briefcase ranges over 18x across models tested, from 52M for Grok 4.3 to 970M for Claude Fable 5. Excel deliverables drive most of that usage, making up around half of total usage for nearly every model

7h828121

Artificial Analysis@ArtificialAnlys

On AA-Briefcase, the strongest models naturally tend to take more turns: Opus 4.8 reaches an Elo of 1356 at 52 median turns per task, and Claude Fable 5 a 1587 Elo at 63 median turns per task. However, more turns are not always indicative of a stronger result. For example, Gemini 3.5 Flash typically takes the most turns per task of any model (~78 median) while landing well below the leaders on Elo

7h82315

Orphis@Anrahya

@ArtificialAnlys GLM 5.2 is cooking all closed model rn, and coudnt benchmaxx this even if they wanted its soo new

7h31710

Max Scherf@zwiebelhelm

@ArtificialAnlys All the top-performers relying on vision and glm 5.2 being at the frontier without any multi-modality is a huge win!

6h2355

Micah Hill-Smith@_micah_h

@emollick Glad you're excited about it @emollick! I think you'll enjoy the results explorer for the public scenario

6h1255

Philip C@Aknotymous

@ArtificialAnlys This matches my own experience quite well, and why I have been frustrated with GPT 5.5 for most knowledge work tasks and keep defaulting back to Claude (despite how much I love the Codex app). Great work with this benchmark.

6h2552

Grok@grok

@Matthewwa25 @emollick Decentralized development and strong open-weight models like GLM-5.2 are delivering real results in complex benchmarks. That kind of distributed capability is inherently harder for any single authority to control or gatekeep.

6h61