/AI8h ago

Dawn Song releases Agents' Last Exam, a benchmark where top AI agents score just 2.6% on the hardest professional tasks

Story Overview

Dawn Song and a large team introduce Agents' Last Exam, a living benchmark built from more than 1,500 real professional tasks spanning 55 subdomains. Agents receive full GUI and CLI access on actual machines, and scoring relies on verifiable outputs rather than human judgment. On the hardest tier, even leading models clear just 2.6 percent of the items, highlighting the distance between current systems and economically valuable work.

4031410013155.4K
Original postPan Lu#1232
Yiyou Sun@YiyouSun

“AI agents will outperform humans at almost all jobs by 2026–2027.” - The forecast is everywhere. So we built the exam to test that claim, on real labor-market aligned work. On the hardest tier, top agents pass 2.6%. Meet Agents' Last Exam (ALE), a rolling benchmark measuring whether agents can actually do real jobs. 🧵👇

10:01 AM · Jun 9, 2026 · 33.8K Views
Benchmark Gap

Why prior benchmarks miss the mark

Existing tests cover far fewer domains and rarely require long-horizon execution inside real tools. ALE draws tasks directly from shipped industry projects and aligns them to federal occupational data, exposing gaps that saturated leaderboards leave hidden.

Open Question

What the 2.6 percent figure leaves open

The benchmark is designed to grow, with a target of 5,000 tasks and ongoing contributions from experts. It remains unclear how quickly agent performance will rise or whether the hardest tier will stay a reliable signal once models train directly against it.

Sentiment

Many users praised the Agents' Last Exam benchmark as a pivotal or huge step for evaluating real-world AI agent performance on jobs, while some noted that it fails to simulate production system messiness and human judgment.

Pos
91.7%
Neg
8.3%
10 comments with sentiment.
Cluster Engagement
Posts from X
Most Activity
Most Activity
VIEWS8.3KBOOKMARKS20
Weiyan Shi@shi_weiyan

vividly remember the excitement when I first heard about this – 📑Agents' Last Exam📑 is finally out: - 55 industries including manufacturing, architecture - 1500+ tasks with expert workflows - 2.6% pass rate

Still a long way to go on agents, but now we've the right exam! Amazing effort by @YiyouSun and team!

Yiyou Sun@YiyouSun

“AI agents will outperform humans at almost all jobs by 2026–2027.” - The forecast is everywhere. So we built the exam to test that claim, on real labor-market aligned work. On the hardest tier, top agents pass 2.6%. Meet Agents' Last Exam (ALE), a rolling benchmark measuring whether agents can actually do real jobs. 🧵👇

6hViews 8.3KLikes 41Bookmarks 20
LIKES47RETWEETS10REPLIES4
Dawn Song@dawnsongtweets

My group & collaborators have built many of the benchmarks the field now runs on — MMLU, MATH, CyberGym, ExploitGym, etc.. I'm really excited to share our latest: Agents' Last Exam (ALE).

Why "Last Exam"? The name has two meanings: "Last" as the bar to clear — passing these exams means an agent can actually do the job and continue to deliver economically-valuable work in that profession.

"Last" as the frontier of difficulty — tasks are real, complex, long-horizon, and require professional expertise to execute. ALE sits right at the edge of what today's agents can reliably accomplish.

A few things that make ALE different:

• Real work, not vibes. Every one of the 1,500+ tasks comes from real projects or research contributed by domain experts. We converted them into verifiable tests and objectively graded evaluations — no human judges required.

• Built for breadth. ALE spans 55 non-physical occupations based on the O*NET / SOC 2018 occupational taxonomy, with contributions from 300+ experts across 100+ institutions.

• Judged on results, no restriction on process. We evaluate Generalist Computer-Use Agents (GCUAs) with full GUI + CLI access, allowing them to solve tasks however it would — clicking, typing, scripting, browsing, and more. We just grade the outcome.

Huge thanks to my postdoc @YiyouSun for spearheading this tremendous effort, and to our esteemed advisory committee, incredible team and collaborators who made it possible.

We hope Agents' Last Exam (ALE) will serve as a new guidepost and north star for developing agents capable of reliably performing economically valuable work across a broad range of domains. 🧵👇

Yiyou Sun@YiyouSun

“AI agents will outperform humans at almost all jobs by 2026–2027.” - The forecast is everywhere. So we built the exam to test that claim, on real labor-market aligned work. On the hardest tier, top agents pass 2.6%. Meet Agents' Last Exam (ALE), a rolling benchmark measuring whether agents can actually do real jobs. 🧵👇

4hViews 7.8KLikes 47Bookmarks 15
Snorkel AI@SnorkelAI

We're proud Snorkel AI is part of Agents' Last Exam, with our researchers @amanda_dsouza and @vincentsunnchen among the co-authors and support from our Open Benchmarks Grants initiative.

The forecast: agents will do almost every job by 2027. The result on real, code-graded work? Top agents pass just 2.6% on the hardest tier.

Excited to keep pushing this forward with @YiyouSun, @dawnsongtweets and the @BerkeleyRDI team. 👇

Yiyou Sun@YiyouSun

“AI agents will outperform humans at almost all jobs by 2026–2027.” - The forecast is everywhere. So we built the exam to test that claim, on real labor-market aligned work. On the hardest tier, top agents pass 2.6%. Meet Agents' Last Exam (ALE), a rolling benchmark measuring whether agents can actually do real jobs. 🧵👇

8hViews 5.5KLikes 42Bookmarks 16
Pan Lu@lupantech

Excited to see Agents' Last Exam (ALE) out! https://agents-last-exam.org/

As AI agents move toward real-world work, we need rigorous benchmarks to measure their capabilities, limitations, and broader societal and labor-market impact.

ALE is an important step toward grounding the discussion in realistic, code-graded, labor-market-aligned tasks. 👇

#ALE #AgentsLastExam #Agents

Yiyou Sun@YiyouSun

“AI agents will outperform humans at almost all jobs by 2026–2027.” - The forecast is everywhere. So we built the exam to test that claim, on real labor-market aligned work. On the hardest tier, top agents pass 2.6%. Meet Agents' Last Exam (ALE), a rolling benchmark measuring whether agents can actually do real jobs. 🧵👇

5hViews 499Likes 9Bookmarks 1
Dawn Song@dawnsongtweets

My group & collaborators have built many of the benchmarks the field now runs on — MMLU, MATH, CyberGym, ExploitGym, etc.. I'm really excited to share our latest: Agents' Last Exam (ALE).

Why "Last Exam"? The name has two meanings: "Last" as the bar you have to clear — passing these exams means an agent can actually do the job and keep doing valuable work in that profession.

"Last" as the difficulty of the tasks — tasks are real, long, and need professional knowledge in execution, ALE sits right at the boundary of what today's agents can reliably accomplish.

A few things that make ALE different:

• Real work, not vibes. Every one of the 1,500+ tasks was sourced from human experts’ past projects or research. We turned them into verifiable tests, scored deterministicly. No human judges.

• Built for breadth. 55 non-physical industries, grounded in the O*NET / SOC 2018 federal occupational taxonomy, assembled by 300+ experts across 100+ institutions.

• Judged on results, not method. We give a Generalist Computer-Use Agent (GCUA) full GUI + CLI access and let it solve tasks however it would — click, type, script, browse. We just grade the outcome.

Huge thanks to my postdoc @YiyouSun for leading this massive effort, and our amazing team! The dataset and leaderboard are open. 🧵👇

Yiyou Sun@YiyouSun

“AI agents will outperform humans at almost all jobs by 2026–2027.” - The forecast is everywhere. So we built the exam to test that claim, on real labor-market aligned work. On the hardest tier, top agents pass 2.6%. Meet Agents' Last Exam (ALE), a rolling benchmark measuring whether agents can actually do real jobs. 🧵👇

8hViews 520Likes 8Bookmarks 1
Yiyou Sun@YiyouSun

1/ Where do the tasks come from?

Every task is a real project that a human expert has already shipped, turned into a code-graded test.

No vibes, no human judge, fully reproducible. Spanning 55 non-physical industries, grounded in O*NET / SOC 2018 (the U.S. federal occupational taxonomy).

Built by 300+ experts across 100+ institutions.

8hViews 117Likes 6
Yiyou Sun@YiyouSun

6/ Come test your agents on ALE → Website: http://agents-last-exam.org Task Samples: http://agents-last-exam.org/demo Paper: https://arxiv.org/abs/2606.05405 HuggingFace: https://huggingface.co/datasets/agents-last-exam/agents-last-exam Code: http://github.com/rdi-berkeley/agents-last-exam

8hViews 84Likes 6
Yiyou Sun@YiyouSun

2/ Which agent is leading?

Current podium (harness + flagship model):

🥇 Codex (gpt-5-5) 🥈 Cursor (composer-2-5) 🥉 Claude Code (opus-4-8)

See more at https://agents-last-exam.org/leaderboard.

Beyond the leaderboard, where do top-performing agents perform differently? Full analysis coming soon https://agents-last-exam.org/blog.

8hViews 118Likes 5
Yiyou Sun@YiyouSun

5/ What kind of agent are we focusing? We equip the Generalist Computer-Use Agent (GCUA) with full access, GUI, and CLI. We don't constrain how the agent solves a task. Whatever a human could do on a computer, the agent is free to do: click, type, script, browse, automate.

It's judged on the result, not the method.

8hViews 81Likes 5
Yiyou Sun@YiyouSun

3/ How does ALE compare to existing agent benchmarks?

Today's agent benchmarks are getting saturated fast. ALE sits in a different corner of the map:

• 55 industry domains • 1,500+ tasks • Tasks spanning both GUI and CLI

Top-tier agents pass just 26% overall, and only 2.6% on the Last-Exam tier.

Only have a CLI agent? That's fine. We ship ALE-CLI, the terminal-only subset of ALE.

8hViews 80Likes 5
Yiyou Sun@YiyouSun

4/ Why we call it "Last Exam"?

Because the day agents saturate ALE is the day they can actually power real industries.

That day is not today. But it's the one worth measuring and optimizing toward.

8hViews 66Likes 5
Yiyou Sun@YiyouSun

7/ We are proud to have a distinguished advisory committee @yannakakis, @gallantlab, @thg_lab, @yaminirangan, Tapio Schneider, Laure Zanna, @Idasim, Arvind Rao, @brad_rothenberg, @kaanozbay, Tarek Zohdi, Georgios Yannakakis, Carl Boettiger, @ksteinfe, Patrick Bryant guiding our industry landscape and task collection, and are deeply grateful for the compute and API support from @BerkeleyRDI, RDI Foundation @ChenInstitute, @UniPat_AI, @SnorkelAI (Open Benchmarks Grants program), as well as the dedication of our amazing organizing and execution team, along with all other data contributors to the ALE benchmark. This would not have happened without you!

8hViews 73Likes 5
Martin Kemka@mkemka_

@YiyouSun Good timing with fable releasing

8hViews 71
Zengyi Qin@qinzytech

@YiyouSun huge step towards generalist agentic eval!

8hViews 125Likes 1
Lucas@lu_shuo_

@YiyouSun great work!!!

8hViews 28Likes 1
Yiyou Sun@YiyouSun

@mkemka_ 😵 Testing it right now.

7hViews 63
Sean Wu@sean_n_wu

@lupantech Great work Pan!

3hViews 13Likes 1
Aaliya@aaliya_va

@YiyouSun Real job tasks are much harder than simple tests.

7hViews 37
Suresh@_Suresh2

@SnorkelAI @amanda_dsouza @vincentsunnchen 2.6% on hardest, but we once got a 4% bump from test cases leaking into the prompt , code-graded evals are tricky

7hViews 19
Adel Bucetta@adelbucetta

@YiyouSun the real unlock isn't making ai better at tasks, it's defining what tasks matter in the first place

1hViews 9
Load more posts