/Tech1d ago

Dawn Song releases Agents' Last Exam, a benchmark where top AI agents score just 2.6% on the hardest professional tasks

Story Overview

Dawn Song and a large team introduce Agents' Last Exam, a living benchmark built from more than 1,500 real professional tasks spanning 55 subdomains. Agents receive full GUI and CLI access on actual machines, and scoring relies on verifiable outputs rather than human judgment. On the hardest tier, even leading models clear just 2.6 percent of the items, highlighting the distance between current systems and economically valuable work.

50515131287106.7K

Original post

Yiyou Sun@YiyouSun

“AI agents will outperform humans at almost all jobs by 2026–2027.” - The forecast is everywhere. So we built the exam to test that claim, on real labor-market aligned work. On the hardest tier, top agents pass 2.6%. Meet Agents' Last Exam (ALE), a rolling benchmark measuring whether agents can actually do real jobs. 🧵👇

10:01 AM · Jun 9, 2026 · 68.4K Views

Benchmark Gap

Why prior benchmarks miss the mark

Existing tests cover far fewer domains and rarely require long-horizon execution inside real tools. ALE draws tasks directly from shipped industry projects and aligns them to federal occupational data, exposing gaps that saturated leaderboards leave hidden.

Open Question

What the 2.6 percent figure leaves open

The benchmark is designed to grow, with a target of 5,000 tasks and ongoing contributions from experts. It remains unclear how quickly agent performance will rise or whether the hardest tier will stay a reliable signal once models train directly against it.

Sentiment

Positive users praise the new Agents' Last Exam benchmark because it realistically tests AI agents on hard real-world tasks and defines what matters for generalist evaluation.

Pos

100.0%

Neg

0.0%

7 comments with sentiment.

Cluster Engagement

Posts from X

Most Activity

VIEWS16.7KBOOKMARKS42LIKES84RETWEETS16REPLIES5

Dawn Song@dawnsongtweets

My group & collaborators have built many of the benchmarks the field now runs on — MMLU, MATH, CyberGym, ExploitGym, etc.. I'm really excited to share our latest: Agents' Last Exam (ALE).

Why "Last Exam"? The name has two meanings: "Last" as the bar to clear — passing these exams means an agent can actually do the job and continue to deliver economically-valuable work in that profession.

"Last" as the frontier of difficulty — tasks are real, complex, long-horizon, and require professional expertise to execute. ALE sits right at the edge of what today's agents can reliably accomplish.

A few things that make ALE different:

• Real work, not vibes. Every one of the 1,500+ tasks comes from real projects or research contributed by domain experts. We converted them into verifiable tests and objectively graded evaluations — no human judges required.

• Built for breadth. ALE spans 55 non-physical occupations based on the O*NET / SOC 2018 occupational taxonomy, with contributions from 300+ experts across 100+ institutions.

• Judged on results, no restriction on process. We evaluate Generalist Computer-Use Agents (GCUAs) with full GUI + CLI access, allowing them to solve tasks however it would — clicking, typing, scripting, browsing, and more. We just grade the outcome.

Huge thanks to my postdoc @YiyouSun for spearheading this tremendous effort, and to our esteemed advisory committee, incredible team and collaborators who made it possible.

We hope Agents' Last Exam (ALE) will serve as a new guidepost and north star for developing agents capable of reliably performing economically valuable work across a broad range of domains. 🧵👇

Yiyou Sun@YiyouSun

“AI agents will outperform humans at almost all jobs by 2026–2027.” - The forecast is everywhere. So we built the exam to test that claim, on real labor-market aligned work. On the hardest tier, top agents pass 2.6%. Meet Agents' Last Exam (ALE), a rolling benchmark measuring whether agents can actually do real jobs. 🧵👇

1d16.7K8442

Weiyan Shi@shi_weiyan

vividly remember the excitement when I first heard about this – 📑Agents' Last Exam📑 is finally out: - 55 industries including manufacturing, architecture - 1500+ tasks with expert workflows - 2.6% pass rate

Still a long way to go on agents, but now we've the right exam! Amazing effort by @YiyouSun and team!

Yiyou Sun@YiyouSun

“AI agents will outperform humans at almost all jobs by 2026–2027.” - The forecast is everywhere. So we built the exam to test that claim, on real labor-market aligned work. On the hardest tier, top agents pass 2.6%. Meet Agents' Last Exam (ALE), a rolling benchmark measuring whether agents can actually do real jobs. 🧵👇

1d12.9K5633

Snorkel AI@SnorkelAI

We're proud Snorkel AI is part of Agents' Last Exam, with our researchers @amanda_dsouza and @vincentsunnchen among the co-authors and support from our Open Benchmarks Grants initiative.

The forecast: agents will do almost every job by 2027. The result on real, code-graded work? Top agents pass just 2.6% on the hardest tier.

Excited to keep pushing this forward with @YiyouSun, @dawnsongtweets and the @BerkeleyRDI team. 👇

Yiyou Sun@YiyouSun

“AI agents will outperform humans at almost all jobs by 2026–2027.” - The forecast is everywhere. So we built the exam to test that claim, on real labor-market aligned work. On the hardest tier, top agents pass 2.6%. Meet Agents' Last Exam (ALE), a rolling benchmark measuring whether agents can actually do real jobs. 🧵👇

1d7.2K5323

Pan Lu@lupantech

Excited to see Agents' Last Exam (ALE) out! https://agents-last-exam.org/

As AI agents move toward real-world work, we need rigorous benchmarks to measure their capabilities, limitations, and broader societal and labor-market impact.

ALE is an important step toward grounding the discussion in realistic, code-graded, labor-market-aligned tasks. 👇

#ALE #AgentsLastExam #Agents

Yiyou Sun@YiyouSun

“AI agents will outperform humans at almost all jobs by 2026–2027.” - The forecast is everywhere. So we built the exam to test that claim, on real labor-market aligned work. On the hardest tier, top agents pass 2.6%. Meet Agents' Last Exam (ALE), a rolling benchmark measuring whether agents can actually do real jobs. 🧵👇

1d1.5K133

vincent sunn chen@vincentsunnchen

Agents' Last Exam covers 1,490 domain-specific environments/tasks across 55 industries, with a focus on: - realistic, domain-specific environments and tasks (e.g. SolidWorks or Rhino for architecture) - verification via deterministic rubrics rather than an LLM judge - new coverage of 13/55 previously uncovered domains

Work led by @YiyouSun @Xinyang_Han_ @dawnsongtweets & the @BerkeleyRDI team- we @SnorkelAI are glad to collaborate on this benchmark to measure economically-valuable work

Yiyou Sun@YiyouSun

“AI agents will outperform humans at almost all jobs by 2026–2027.” - The forecast is everywhere. So we built the exam to test that claim, on real labor-market aligned work. On the hardest tier, top agents pass 2.6%. Meet Agents' Last Exam (ALE), a rolling benchmark measuring whether agents can actually do real jobs. 🧵👇

1d992141

Dawn Song@dawnsongtweets

My group & collaborators have built many of the benchmarks the field now runs on — MMLU, MATH, CyberGym, ExploitGym, etc.. I'm really excited to share our latest: Agents' Last Exam (ALE).

Why "Last Exam"? The name has two meanings: "Last" as the bar you have to clear — passing these exams means an agent can actually do the job and keep doing valuable work in that profession.

"Last" as the difficulty of the tasks — tasks are real, long, and need professional knowledge in execution, ALE sits right at the boundary of what today's agents can reliably accomplish.

A few things that make ALE different:

• Real work, not vibes. Every one of the 1,500+ tasks was sourced from human experts’ past projects or research. We turned them into verifiable tests, scored deterministicly. No human judges.

• Built for breadth. 55 non-physical industries, grounded in the O*NET / SOC 2018 federal occupational taxonomy, assembled by 300+ experts across 100+ institutions.

• Judged on results, not method. We give a Generalist Computer-Use Agent (GCUA) full GUI + CLI access and let it solve tasks however it would — click, type, script, browse. We just grade the outcome.

Huge thanks to my postdoc @YiyouSun for leading this massive effort, and our amazing team! The dataset and leaderboard are open. 🧵👇

Yiyou Sun@YiyouSun

“AI agents will outperform humans at almost all jobs by 2026–2027.” - The forecast is everywhere. So we built the exam to test that claim, on real labor-market aligned work. On the hardest tier, top agents pass 2.6%. Meet Agents' Last Exam (ALE), a rolling benchmark measuring whether agents can actually do real jobs. 🧵👇

1d52081

Yiyou Sun@YiyouSun

1/ Where do the tasks come from?

Every task is a real project that a human expert has already shipped, turned into a code-graded test.

No vibes, no human judge, fully reproducible. Spanning 55 non-physical industries, grounded in O*NET / SOC 2018 (the U.S. federal occupational taxonomy).

Built by 300+ experts across 100+ institutions.

1d1176

Yiyou Sun@YiyouSun

6/ Come test your agents on ALE → Website: http://agents-last-exam.org Task Samples: http://agents-last-exam.org/demo Paper: https://arxiv.org/abs/2606.05405 HuggingFace: https://huggingface.co/datasets/agents-last-exam/agents-last-exam Code: http://github.com/rdi-berkeley/agents-last-exam

1d846

Yiyou Sun@YiyouSun

2/ Which agent is leading?

Current podium (harness + flagship model):

🥇 Codex (gpt-5-5) 🥈 Cursor (composer-2-5) 🥉 Claude Code (opus-4-8)

See more at https://agents-last-exam.org/leaderboard.

Beyond the leaderboard, where do top-performing agents perform differently? Full analysis coming soon https://agents-last-exam.org/blog.

1d1185

Yiyou Sun@YiyouSun

5/ What kind of agent are we focusing? We equip the Generalist Computer-Use Agent (GCUA) with full access, GUI, and CLI. We don't constrain how the agent solves a task. Whatever a human could do on a computer, the agent is free to do: click, type, script, browse, automate.

It's judged on the result, not the method.

1d815

Yiyou Sun@YiyouSun

3/ How does ALE compare to existing agent benchmarks?

Today's agent benchmarks are getting saturated fast. ALE sits in a different corner of the map:

• 55 industry domains • 1,500+ tasks • Tasks spanning both GUI and CLI

Top-tier agents pass just 26% overall, and only 2.6% on the Last-Exam tier.

Only have a CLI agent? That's fine. We ship ALE-CLI, the terminal-only subset of ALE.

1d805

Yiyou Sun@YiyouSun

4/ Why we call it "Last Exam"?

Because the day agents saturate ALE is the day they can actually power real industries.

That day is not today. But it's the one worth measuring and optimizing toward.

1d665

Yiyou Sun@YiyouSun

7/ We are proud to have a distinguished advisory committee @yannakakis, @gallantlab, @thg_lab, @yaminirangan, Tapio Schneider, Laure Zanna, @Idasim, Arvind Rao, @brad_rothenberg, @kaanozbay, Tarek Zohdi, Georgios Yannakakis, Carl Boettiger, @ksteinfe, Patrick Bryant guiding our industry landscape and task collection, and are deeply grateful for the compute and API support from @BerkeleyRDI, RDI Foundation @ChenInstitute, @UniPat_AI, @SnorkelAI (Open Benchmarks Grants program), as well as the dedication of our amazing organizing and execution team, along with all other data contributors to the ALE benchmark. This would not have happened without you!

1d735

Martin Kemka@mkemka_

@YiyouSun Good timing with fable releasing

1d71

Zengyi Qin@qinzytech

@YiyouSun huge step towards generalist agentic eval!

1d1251

Lucas@lu_shuo_

@YiyouSun great work!!!

1d281

Yiyou Sun@YiyouSun

@mkemka_ 😵 Testing it right now.

1d63

Sean Wu@sean_n_wu

@lupantech Great work Pan!

1d131

Aaliya@aaliya_va

@YiyouSun Real job tasks are much harder than simple tests.

1d37

Suresh@_Suresh2

@SnorkelAI @amanda_dsouza @vincentsunnchen 2.6% on hardest, but we once got a 4% bump from test cases leaking into the prompt , code-graded evals are tricky

1d19