/Tech1d ago

Dawn Song releases Agents' Last Exam, a benchmark where top AI agents score just 2.6% on the hardest professional tasks

Story Overview

Dawn Song and a large team introduce Agents' Last Exam, a living benchmark built from more than 1,500 real professional tasks spanning 55 subdomains. Agents receive full GUI and CLI access on actual machines, and scoring relies on verifiable outputs rather than human judgment. On the hardest tier, even leading models clear just 2.6 percent of the items, highlighting the distance between current systems and economically valuable work.

50515131287106.7K
Original postPan Lu#1353
Yiyou Sun@YiyouSun

“AI agents will outperform humans at almost all jobs by 2026–2027.” - The forecast is everywhere. So we built the exam to test that claim, on real labor-market aligned work. On the hardest tier, top agents pass 2.6%. Meet Agents' Last Exam (ALE), a rolling benchmark measuring whether agents can actually do real jobs. 🧵👇

10:01 AM · Jun 9, 2026 · 68.4K Views
Benchmark Gap

Why prior benchmarks miss the mark

Existing tests cover far fewer domains and rarely require long-horizon execution inside real tools. ALE draws tasks directly from shipped industry projects and aligns them to federal occupational data, exposing gaps that saturated leaderboards leave hidden.

Open Question

What the 2.6 percent figure leaves open

The benchmark is designed to grow, with a target of 5,000 tasks and ongoing contributions from experts. It remains unclear how quickly agent performance will rise or whether the hardest tier will stay a reliable signal once models train directly against it.

Sentiment

Positive users praise the new Agents' Last Exam benchmark because it realistically tests AI agents on hard real-world tasks and defines what matters for generalist evaluation.

Pos
100.0%
Neg
0.0%
7 comments with sentiment.
Cluster Engagement
Posts from X
Most Activity
Most Activity
VIEWS16.7KBOOKMARKS42LIKES84RETWEETS16REPLIES5
Dawn Song@dawnsongtweets

My group & collaborators have built many of the benchmarks the field now runs on — MMLU, MATH, CyberGym, ExploitGym, etc.. I'm really excited to share our latest: Agents' Last Exam (ALE).

Why "Last Exam"? The name has two meanings: "Last" as the bar to clear — passing these exams means an agent can actually do the job and continue to deliver economically-valuable work in that profession.

"Last" as the frontier of difficulty — tasks are real, complex, long-horizon, and require professional expertise to execute. ALE sits right at the edge of what today's agents can reliably accomplish.

A few things that make ALE different:

• Real work, not vibes. Every one of the 1,500+ tasks comes from real projects or research contributed by domain experts. We converted them into verifiable tests and objectively graded evaluations — no human judges required.

• Built for breadth. ALE spans 55 non-physical occupations based on the O*NET / SOC 2018 occupational taxonomy, with contributions from 300+ experts across 100+ institutions.

• Judged on results, no restriction on process. We evaluate Generalist Computer-Use Agents (GCUAs) with full GUI + CLI access, allowing them to solve tasks however it would — clicking, typing, scripting, browsing, and more. We just grade the outcome.

Huge thanks to my postdoc @YiyouSun for spearheading this tremendous effort, and to our esteemed advisory committee, incredible team and collaborators who made it possible.

We hope Agents' Last Exam (ALE) will serve as a new guidepost and north star for developing agents capable of reliably performing economically valuable work across a broad range of domains. 🧵👇

Yiyou Sun@YiyouSun

“AI agents will outperform humans at almost all jobs by 2026–2027.” - The forecast is everywhere. So we built the exam to test that claim, on real labor-market aligned work. On the hardest tier, top agents pass 2.6%. Meet Agents' Last Exam (ALE), a rolling benchmark measuring whether agents can actually do real jobs. 🧵👇

1dViews 16.7KLikes 84Bookmarks 42
Weiyan Shi@shi_weiyan

vividly remember the excitement when I first heard about this – 📑Agents' Last Exam📑 is finally out: - 55 industries including manufacturing, architecture - 1500+ tasks with expert workflows - 2.6% pass rate

Still a long way to go on agents, but now we've the right exam! Amazing effort by @YiyouSun and team!

Yiyou Sun@YiyouSun

“AI agents will outperform humans at almost all jobs by 2026–2027.” - The forecast is everywhere. So we built the exam to test that claim, on real labor-market aligned work. On the hardest tier, top agents pass 2.6%. Meet Agents' Last Exam (ALE), a rolling benchmark measuring whether agents can actually do real jobs. 🧵👇

1dViews 12.9KLikes 56Bookmarks 33
Snorkel AI@SnorkelAI

We're proud Snorkel AI is part of Agents' Last Exam, with our researchers @amanda_dsouza and @vincentsunnchen among the co-authors and support from our Open Benchmarks Grants initiative.

The forecast: agents will do almost every job by 2027. The result on real, code-graded work? Top agents pass just 2.6% on the hardest tier.

Excited to keep pushing this forward with @YiyouSun, @dawnsongtweets and the @BerkeleyRDI team. 👇

Yiyou Sun@YiyouSun

“AI agents will outperform humans at almost all jobs by 2026–2027.” - The forecast is everywhere. So we built the exam to test that claim, on real labor-market aligned work. On the hardest tier, top agents pass 2.6%. Meet Agents' Last Exam (ALE), a rolling benchmark measuring whether agents can actually do real jobs. 🧵👇

1dViews 7.2KLikes 53Bookmarks 23
Pan Lu@lupantech

Excited to see Agents' Last Exam (ALE) out! https://agents-last-exam.org/

As AI agents move toward real-world work, we need rigorous benchmarks to measure their capabilities, limitations, and broader societal and labor-market impact.

ALE is an important step toward grounding the discussion in realistic, code-graded, labor-market-aligned tasks. 👇

#ALE #AgentsLastExam #Agents

Yiyou Sun@YiyouSun

“AI agents will outperform humans at almost all jobs by 2026–2027.” - The forecast is everywhere. So we built the exam to test that claim, on real labor-market aligned work. On the hardest tier, top agents pass 2.6%. Meet Agents' Last Exam (ALE), a rolling benchmark measuring whether agents can actually do real jobs. 🧵👇

1dViews 1.5KLikes 13Bookmarks 3
vincent sunn chen@vincentsunnchen

Agents' Last Exam covers 1,490 domain-specific environments/tasks across 55 industries, with a focus on: - realistic, domain-specific environments and tasks (e.g. SolidWorks or Rhino for architecture) - verification via deterministic rubrics rather than an LLM judge - new coverage of 13/55 previously uncovered domains

Work led by @YiyouSun @Xinyang_Han_ @dawnsongtweets & the @BerkeleyRDI team- we @SnorkelAI are glad to collaborate on this benchmark to measure economically-valuable work

Yiyou Sun@YiyouSun

“AI agents will outperform humans at almost all jobs by 2026–2027.” - The forecast is everywhere. So we built the exam to test that claim, on real labor-market aligned work. On the hardest tier, top agents pass 2.6%. Meet Agents' Last Exam (ALE), a rolling benchmark measuring whether agents can actually do real jobs. 🧵👇

1dViews 992Likes 14Bookmarks 1
Dawn Song@dawnsongtweets

My group & collaborators have built many of the benchmarks the field now runs on — MMLU, MATH, CyberGym, ExploitGym, etc.. I'm really excited to share our latest: Agents' Last Exam (ALE).

Why "Last Exam"? The name has two meanings: "Last" as the bar you have to clear — passing these exams means an agent can actually do the job and keep doing valuable work in that profession.

"Last" as the difficulty of the tasks — tasks are real, long, and need professional knowledge in execution, ALE sits right at the boundary of what today's agents can reliably accomplish.

A few things that make ALE different:

• Real work, not vibes. Every one of the 1,500+ tasks was sourced from human experts’ past projects or research. We turned them into verifiable tests, scored deterministicly. No human judges.

• Built for breadth. 55 non-physical industries, grounded in the O*NET / SOC 2018 federal occupational taxonomy, assembled by 300+ experts across 100+ institutions.

• Judged on results, not method. We give a Generalist Computer-Use Agent (GCUA) full GUI + CLI access and let it solve tasks however it would — click, type, script, browse. We just grade the outcome.

Huge thanks to my postdoc @YiyouSun for leading this massive effort, and our amazing team! The dataset and leaderboard are open. 🧵👇

Yiyou Sun@YiyouSun

“AI agents will outperform humans at almost all jobs by 2026–2027.” - The forecast is everywhere. So we built the exam to test that claim, on real labor-market aligned work. On the hardest tier, top agents pass 2.6%. Meet Agents' Last Exam (ALE), a rolling benchmark measuring whether agents can actually do real jobs. 🧵👇

1dViews 520Likes 8Bookmarks 1
Yiyou Sun@YiyouSun

1/ Where do the tasks come from?

Every task is a real project that a human expert has already shipped, turned into a code-graded test.

No vibes, no human judge, fully reproducible. Spanning 55 non-physical industries, grounded in O*NET / SOC 2018 (the U.S. federal occupational taxonomy).

Built by 300+ experts across 100+ institutions.

1dViews 117Likes 6
Yiyou Sun@YiyouSun

6/ Come test your agents on ALE → Website: http://agents-last-exam.org Task Samples: http://agents-last-exam.org/demo Paper: https://arxiv.org/abs/2606.05405 HuggingFace: https://huggingface.co/datasets/agents-last-exam/agents-last-exam Code: http://github.com/rdi-berkeley/agents-last-exam

1dViews 84Likes 6
Yiyou Sun@YiyouSun

2/ Which agent is leading?

Current podium (harness + flagship model):

🥇 Codex (gpt-5-5) 🥈 Cursor (composer-2-5) 🥉 Claude Code (opus-4-8)

See more at https://agents-last-exam.org/leaderboard.

Beyond the leaderboard, where do top-performing agents perform differently? Full analysis coming soon https://agents-last-exam.org/blog.

1dViews 118Likes 5
Yiyou Sun@YiyouSun

5/ What kind of agent are we focusing? We equip the Generalist Computer-Use Agent (GCUA) with full access, GUI, and CLI. We don't constrain how the agent solves a task. Whatever a human could do on a computer, the agent is free to do: click, type, script, browse, automate.

It's judged on the result, not the method.

1dViews 81Likes 5
Yiyou Sun@YiyouSun

3/ How does ALE compare to existing agent benchmarks?

Today's agent benchmarks are getting saturated fast. ALE sits in a different corner of the map:

• 55 industry domains • 1,500+ tasks • Tasks spanning both GUI and CLI

Top-tier agents pass just 26% overall, and only 2.6% on the Last-Exam tier.

Only have a CLI agent? That's fine. We ship ALE-CLI, the terminal-only subset of ALE.

1dViews 80Likes 5
Yiyou Sun@YiyouSun

4/ Why we call it "Last Exam"?

Because the day agents saturate ALE is the day they can actually power real industries.

That day is not today. But it's the one worth measuring and optimizing toward.

1dViews 66Likes 5
Yiyou Sun@YiyouSun

7/ We are proud to have a distinguished advisory committee @yannakakis, @gallantlab, @thg_lab, @yaminirangan, Tapio Schneider, Laure Zanna, @Idasim, Arvind Rao, @brad_rothenberg, @kaanozbay, Tarek Zohdi, Georgios Yannakakis, Carl Boettiger, @ksteinfe, Patrick Bryant guiding our industry landscape and task collection, and are deeply grateful for the compute and API support from @BerkeleyRDI, RDI Foundation @ChenInstitute, @UniPat_AI, @SnorkelAI (Open Benchmarks Grants program), as well as the dedication of our amazing organizing and execution team, along with all other data contributors to the ALE benchmark. This would not have happened without you!

1dViews 73Likes 5
Martin Kemka@mkemka_

@YiyouSun Good timing with fable releasing

1dViews 71
Zengyi Qin@qinzytech

@YiyouSun huge step towards generalist agentic eval!

1dViews 125Likes 1
Lucas@lu_shuo_

@YiyouSun great work!!!

1dViews 28Likes 1
Yiyou Sun@YiyouSun

@mkemka_ 😵 Testing it right now.

1dViews 63
Sean Wu@sean_n_wu

@lupantech Great work Pan!

1dViews 13Likes 1
Aaliya@aaliya_va

@YiyouSun Real job tasks are much harder than simple tests.

1dViews 37
Suresh@_Suresh2

@SnorkelAI @amanda_dsouza @vincentsunnchen 2.6% on hardest, but we once got a 4% bump from test cases leaking into the prompt , code-graded evals are tricky

1dViews 19
Load more posts