/Tech4h ago

Jeff Dean and Andy Konwinski launch ALE, an agent benchmark finding frontier models score 0% on complex professional tasks

The evaluation tested GPT-5.5, Fable 5, and Composer 2.5.

104856182442234K

Original post

Dawn Song@dawnsongtweets

Everyone says the latest AI agents will be "job-ready" soon, especially after the release of Fable 5 this week. But is that really the case?

Over the past many months, my group and collaborators have been building Agents' Last Exam (ALE), a benchmark designed to test exactly that claim on real digital labor-market work.

My group and collaborators previously have created many of the benchmarks the field runs on, including MMLU, MATH, CyberGym, and ExploitGym. Today, I'm excited to share Agents' Last Exam (ALE): a rolling benchmark that measures whether AI agents can actually perform economically valuable work across a broad range of real-world domains.

With ALE, we evaluated Fable 5, GPT-5.5, Composer 2.5, and other frontier agent systems across more than 1,500 expert-sourced tasks spanning 55 occupations. The result is both impressive and sobering.

Today's agents can solve a meaningful fraction of professional tasks. But when we look at the hardest tasks, the ones requiring sustained reasoning, deep domain expertise, and reliable execution over long horizons, they are still far from human-level performance.

On ALE's hardest tier, every frontier agent we tested, including Fable 5, achieved a 0% success rate. The age of useful agents is here.

The age of truly job-ready agents is not.

We hope Agents' Last Exam (ALE) will serve as a new guidepost and north star for developing agents capable of reliably performing economically valuable work across a broad range of domains. 🧵

8:36 AM · Jun 11, 2026 · 234.1K Views

Sentiment

Many users praised the Agents' Last Exam benchmark for empirically exposing frontier AI agents' failures on complex real-world tasks, while some dismissed job-replacement hype as unrealistic.

Pos

76.9%

Neg

23.1%

13 comments with sentiment.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Posts from X

Most Activity

Outdated Often@JamesSurra34

@dawnsongtweets Whoever is saying agents will be ready to take jobs soon is dreaming, the harness and the model are no where near being able to complete non structured tasks autonomously maybe in this generation but not anytime soon

6d73011

BOOKMARKS1

There's no "I" in craftmanship@6851cf3c

@GuangyuRobert @dawnsongtweets From the tasks I've seen it seems like they're presented with industry standard tools in the descriptions.

We have decades of (software) tools that people use daily, and you can't really put a HTTP server and JSON codec in all of them and call it MCP.

6d241

LIKES3

Canopy Wave@CanopyWave_AI

@dawnsongtweets The fact that ALE-CLI is already harder than SWE-bench Pro and Terminal-Bench says everything. Great to see frontier models doing meaningful work on mid-tier tasks, but the long-horizon expert tier is clearly where the real challenge lies.

6d2153

RETWEETS182

Dawn Song@dawnsongtweets

Everyone says the latest AI agents will be "job-ready" soon, especially after the release of Fable 5 this week. But is that really the case?

Over the past many months, my group and collaborators have been building Agents' Last Exam (ALE), a benchmark designed to test exactly that claim on real digital labor-market work.

On ALE's hardest tier, every frontier agent we tested, including Fable 5, achieved a 0% success rate. The age of useful agents is here.

The age of truly job-ready agents is not.

We hope Agents' Last Exam (ALE) will serve as a new guidepost and north star for developing agents capable of reliably performing economically valuable work across a broad range of domains. 🧵

7d234.1K856442

REPLIES2

Ler Khachatrian@2034_SWE

@dawnsongtweets Is the benchmark open source? Id like to give a crack at applying my private Harness/Environment/Orchestration engineering systems toward it.

Confident that, with a few iterations, today’s SOTA models may be able to hit 90-100%.

6d22821

Jesus Bosch 🚀@Jbosch_

@dawnsongtweets I think your approach is wrong. We don’t need AGI-like agents. Agents good at specific jobs already exist, they are already working, and they will only improve

6d56321

Adel Bucetta@adelbucetta

@dawnsongtweets but job readiness has nothing to do with human competence, it's about automatable tasks

7d2841

Shubham Sharma | AI & Tech@editxshub

@dawnsongtweets How many of the tasks did Fable 5 refuse to do? (fwiw, marking a refusal as zero is the correct methodology, but the lower score would be surprising otherwise)

6d2112

Jeff Steve@JeffSte17327059

@Tigger0000 @dawnsongtweets @grok composer 2.5 is the ai built by cursor in partnership with Xai from elon musk to build a new set of kimi k2.6 base models but further trained on data from cursor on the datacenter for grok

6d442

Niraj Tulsyan@NirajTulsyan

@dawnsongtweets Thanks for this test , proves my point that humans are better long term.

6d129

Yiyou Sun@YiyouSun

@BradSpahn @dawnsongtweets "We only look bad because we throttled ourselves" is an excuse any vendor could use for any benchmark. We wrote up exactly where Fable 5 falls short (mostly not following simple instructions + calling unverified work "done"): https://agents-last-exam.org/blogs/agent-showdown.

6d37

Jeff Steve@JeffSte17327059

@Tigger0000 @dawnsongtweets @grok no idea why grok didnt reply to you bro rip

6d81

Solgato@Tigger0000

@JeffSte17327059 @dawnsongtweets @grok interesting, there's much more happening with grok than i had any idea about until recently :D thanks!

6d51

Kirill Balakhonov@balakhonoff

@dawnsongtweets @scaling01 where is the comparison with human on the charts?

6d1651

Grounded DI LLC@Grounded_DI

@dawnsongtweets We’ve been wanting to try a test like this for a while. Can we?

6d1531

Adrían Bridgwater✍⌨️@ABridgwater

@dawnsongtweets Hi Dawn... I just LinkedIn with you and tried your mail at your jobs email address, I might like to write this story up. Can you connect on LinkedIn or reply on email? Adrian

6d442

Gerard Sans | Axiom 🇬🇧@gerardsans

@dawnsongtweets

6d1161

Clark@clark__labs

@dawnsongtweets amazing work!

working on getting to 100% on similar benchmark from day 1.

6d303

Pok@Pok30305202

@dawnsongtweets Great! But does it have Opus 4.6 results? Fable 5 was a waste of money and time, only receiving some downgrade responses, like a scam.

6d231

Gerard Sans | Axiom 🇬🇧@gerardsans

@dawnsongtweets

7d198