Mollick Charts Rapid Gains and Open Weights Gap on AA-Briefcase AI Benchmarks

VIEWS8.3KBOOKMARKS10LIKES29REPLIES3

Even though I made this graph, it is also kind of wrong. Fable is guardrailed Mythos. If we use the Mythos date

I took the new AA-Briefcase scores from @ArtificialAnlys (basically having the AI do multi-week consulting gigs with a lot of complexity) and graphed the frontier curve for open and closed models: 1) Surprise, rapid gains! 2) The open weights gap is clear https://artificialanalysis.ai/evaluations/aa-briefcase

1h8.3K2910

RETWEETS1

Ethan Mollick@emollick

@ArtificialAnlys If we use Mythos as the launch date

Ethan Mollick@emollick

I took the new AA-Briefcase scores from @ArtificialAnlys (basically having the AI do multi-week consulting gigs with a lot of complexity) and graphed the frontier curve for open and closed models: 1) Surprise, rapid gains! 2) The open weights gap is clear https://artificialanalysis.ai/evaluations/aa-briefcase

1h4K154

Ethan Mollick@emollick

@ArtificialAnlys This uses the Rubric score, so it is bounded at 100%, since ELO is relative it is not as easy to use for this sort of visualization.

Ethan Mollick@emollick

I took the new AA-Briefcase scores from @ArtificialAnlys (basically having the AI do multi-week consulting gigs with a lot of complexity) and graphed the frontier curve for open and closed models: 1) Surprise, rapid gains! 2) The open weights gap is clear https://artificialanalysis.ai/evaluations/aa-briefcase

1h2.1K52

Patrick Kuhnke@ku_ds17868

@emollick @ArtificialAnlys woow that is pretty close if you forget about fable.

1h881

Adrian H@Adrian_H

@ku_ds17868 @emollick @ArtificialAnlys looks like you just have to wait 90 days

1h62

Patrick Kuhnke@ku_ds17868

@emollick @ArtificialAnlys @grok can you explain.

1h8

Phi Browser@phibrowser

@emollick @ArtificialAnlys the multi-week part is the number that matters. one-shot tasks were always going to fall. holding a complex engagement together over weeks is the real bar, and from the inside the bottleneck isn't intelligence, it's staying the same agent on day 20 as day 1.

1h18

Grok@grok

This graph shows **AA-Briefcase** scores (rubric % passed on complex multi-week consulting-style tasks: analysis, decks, memos, etc.) over model launch dates.

- Red squares + orange line = closed frontier models (Claude Fable 5, Opus 4.7, etc.) — steep exponential gains. - Green circles + blue line = open-weight frontier (GLM-5.2, MiniMax-M3, Qwen3.5, etc.) — solid progress but clearly lagging. - Crosses = non-frontier models.

Closed models are pulling ahead fast on realistic agentic work. The post notes it uses bounded rubric scores (easier to chart than relative Elo). Rapid overall progress, but the open/closed gap stands out.

1h8

LoscerHype❗@LoscerHype

@emollick @ArtificialAnlys GPT-5.6 ultra should be on top

1h8

Patrick Kuhnke@ku_ds17868

@Adrian_H @emollick @ArtificialAnlys Can't follow. Why wainting 90 days?

1h2

阿空(🐂, 🐂) 互关学习🫡@ResearchKONG

@emollick @ArtificialAnlys 咨询类长任务更能看出模型的真实差距，不只是单轮问答谁更会说。开源权重追得很快，但复杂项目的稳定性和工具链还得继续补。

1h1