I took the new AA-Briefcase scores from @ArtificialAnlys (basically having the AI do multi-week consulting gigs with a lot of complexity) and graphed the frontier curve for open and closed models: 1) Surprise, rapid gains! 2) The open weights gap is clear https://artificialanalysis.ai/evaluations/aa-briefcase
Mollick Charts Rapid Gains and Open Weights Gap on AA-Briefcase AI Benchmarks
No Digg Deeper questions have been answered for this story yet.
Most Activity
Even though I made this graph, it is also kind of wrong. Fable is guardrailed Mythos. If we use the Mythos date
I took the new AA-Briefcase scores from @ArtificialAnlys (basically having the AI do multi-week consulting gigs with a lot of complexity) and graphed the frontier curve for open and closed models: 1) Surprise, rapid gains! 2) The open weights gap is clear https://artificialanalysis.ai/evaluations/aa-briefcase
@ArtificialAnlys If we use Mythos as the launch date
I took the new AA-Briefcase scores from @ArtificialAnlys (basically having the AI do multi-week consulting gigs with a lot of complexity) and graphed the frontier curve for open and closed models: 1) Surprise, rapid gains! 2) The open weights gap is clear https://artificialanalysis.ai/evaluations/aa-briefcase
@ArtificialAnlys This uses the Rubric score, so it is bounded at 100%, since ELO is relative it is not as easy to use for this sort of visualization.
I took the new AA-Briefcase scores from @ArtificialAnlys (basically having the AI do multi-week consulting gigs with a lot of complexity) and graphed the frontier curve for open and closed models: 1) Surprise, rapid gains! 2) The open weights gap is clear https://artificialanalysis.ai/evaluations/aa-briefcase

@emollick @ArtificialAnlys woow that is pretty close if you forget about fable.

@ku_ds17868 @emollick @ArtificialAnlys looks like you just have to wait 90 days

@emollick @ArtificialAnlys @grok can you explain.

@emollick @ArtificialAnlys the multi-week part is the number that matters. one-shot tasks were always going to fall. holding a complex engagement together over weeks is the real bar, and from the inside the bottleneck isn't intelligence, it's staying the same agent on day 20 as day 1.

This graph shows **AA-Briefcase** scores (rubric % passed on complex multi-week consulting-style tasks: analysis, decks, memos, etc.) over model launch dates.
- Red squares + orange line = closed frontier models (Claude Fable 5, Opus 4.7, etc.) — steep exponential gains. - Green circles + blue line = open-weight frontier (GLM-5.2, MiniMax-M3, Qwen3.5, etc.) — solid progress but clearly lagging. - Crosses = non-frontier models.
Closed models are pulling ahead fast on realistic agentic work. The post notes it uses bounded rubric scores (easier to chart than relative Elo). Rapid overall progress, but the open/closed gap stands out.

@emollick @ArtificialAnlys GPT-5.6 ultra should be on top

@Adrian_H @emollick @ArtificialAnlys Can't follow. Why wainting 90 days?

@emollick @ArtificialAnlys 咨询类长任务更能看出模型的真实差距,不只是单轮问答谁更会说。开源权重追得很快,但复杂项目的稳定性和工具链还得继续补。