“AI agents will outperform humans at almost all jobs by 2026–2027.” - The forecast is everywhere. So we built the exam to test that claim, on real labor-market aligned work. On the hardest tier, top agents pass 2.6%. Meet Agents' Last Exam (ALE), a rolling benchmark measuring whether agents can actually do real jobs. 🧵👇
Dawn Song releases Agents' Last Exam, a benchmark where top AI agents score just 2.6% on the hardest professional tasks
Story Overview
Dawn Song and a large team introduce Agents' Last Exam, a living benchmark built from more than 1,500 real professional tasks spanning 55 subdomains. Agents receive full GUI and CLI access on actual machines, and scoring relies on verifiable outputs rather than human judgment. On the hardest tier, even leading models clear just 2.6 percent of the items, highlighting the distance between current systems and economically valuable work.
Why prior benchmarks miss the mark
Existing tests cover far fewer domains and rarely require long-horizon execution inside real tools. ALE draws tasks directly from shipped industry projects and aligns them to federal occupational data, exposing gaps that saturated leaderboards leave hidden.
What the 2.6 percent figure leaves open
The benchmark is designed to grow, with a target of 5,000 tasks and ongoing contributions from experts. It remains unclear how quickly agent performance will rise or whether the hardest tier will stay a reliable signal once models train directly against it.
Many users praised the Agents' Last Exam benchmark as a pivotal or huge step for evaluating real-world AI agent performance on jobs, while some noted that it fails to simulate production system messiness and human judgment.
Most Activity
vividly remember the excitement when I first heard about this – 📑Agents' Last Exam📑 is finally out: - 55 industries including manufacturing, architecture - 1500+ tasks with expert workflows - 2.6% pass rate
Still a long way to go on agents, but now we've the right exam! Amazing effort by @YiyouSun and team!
“AI agents will outperform humans at almost all jobs by 2026–2027.” - The forecast is everywhere. So we built the exam to test that claim, on real labor-market aligned work. On the hardest tier, top agents pass 2.6%. Meet Agents' Last Exam (ALE), a rolling benchmark measuring whether agents can actually do real jobs. 🧵👇
My group & collaborators have built many of the benchmarks the field now runs on — MMLU, MATH, CyberGym, ExploitGym, etc.. I'm really excited to share our latest: Agents' Last Exam (ALE).
Why "Last Exam"? The name has two meanings: "Last" as the bar to clear — passing these exams means an agent can actually do the job and continue to deliver economically-valuable work in that profession.
"Last" as the frontier of difficulty — tasks are real, complex, long-horizon, and require professional expertise to execute. ALE sits right at the edge of what today's agents can reliably accomplish.
A few things that make ALE different:
• Real work, not vibes. Every one of the 1,500+ tasks comes from real projects or research contributed by domain experts. We converted them into verifiable tests and objectively graded evaluations — no human judges required.
• Built for breadth. ALE spans 55 non-physical occupations based on the O*NET / SOC 2018 occupational taxonomy, with contributions from 300+ experts across 100+ institutions.
• Judged on results, no restriction on process. We evaluate Generalist Computer-Use Agents (GCUAs) with full GUI + CLI access, allowing them to solve tasks however it would — clicking, typing, scripting, browsing, and more. We just grade the outcome.
Huge thanks to my postdoc @YiyouSun for spearheading this tremendous effort, and to our esteemed advisory committee, incredible team and collaborators who made it possible.
We hope Agents' Last Exam (ALE) will serve as a new guidepost and north star for developing agents capable of reliably performing economically valuable work across a broad range of domains. 🧵👇
“AI agents will outperform humans at almost all jobs by 2026–2027.” - The forecast is everywhere. So we built the exam to test that claim, on real labor-market aligned work. On the hardest tier, top agents pass 2.6%. Meet Agents' Last Exam (ALE), a rolling benchmark measuring whether agents can actually do real jobs. 🧵👇
We're proud Snorkel AI is part of Agents' Last Exam, with our researchers @amanda_dsouza and @vincentsunnchen among the co-authors and support from our Open Benchmarks Grants initiative.
The forecast: agents will do almost every job by 2027. The result on real, code-graded work? Top agents pass just 2.6% on the hardest tier.
Excited to keep pushing this forward with @YiyouSun, @dawnsongtweets and the @BerkeleyRDI team. 👇
“AI agents will outperform humans at almost all jobs by 2026–2027.” - The forecast is everywhere. So we built the exam to test that claim, on real labor-market aligned work. On the hardest tier, top agents pass 2.6%. Meet Agents' Last Exam (ALE), a rolling benchmark measuring whether agents can actually do real jobs. 🧵👇
Excited to see Agents' Last Exam (ALE) out! https://agents-last-exam.org/
As AI agents move toward real-world work, we need rigorous benchmarks to measure their capabilities, limitations, and broader societal and labor-market impact.
ALE is an important step toward grounding the discussion in realistic, code-graded, labor-market-aligned tasks. 👇
#ALE #AgentsLastExam #Agents
“AI agents will outperform humans at almost all jobs by 2026–2027.” - The forecast is everywhere. So we built the exam to test that claim, on real labor-market aligned work. On the hardest tier, top agents pass 2.6%. Meet Agents' Last Exam (ALE), a rolling benchmark measuring whether agents can actually do real jobs. 🧵👇
My group & collaborators have built many of the benchmarks the field now runs on — MMLU, MATH, CyberGym, ExploitGym, etc.. I'm really excited to share our latest: Agents' Last Exam (ALE).
Why "Last Exam"? The name has two meanings: "Last" as the bar you have to clear — passing these exams means an agent can actually do the job and keep doing valuable work in that profession.
"Last" as the difficulty of the tasks — tasks are real, long, and need professional knowledge in execution, ALE sits right at the boundary of what today's agents can reliably accomplish.
A few things that make ALE different:
• Real work, not vibes. Every one of the 1,500+ tasks was sourced from human experts’ past projects or research. We turned them into verifiable tests, scored deterministicly. No human judges.
• Built for breadth. 55 non-physical industries, grounded in the O*NET / SOC 2018 federal occupational taxonomy, assembled by 300+ experts across 100+ institutions.
• Judged on results, not method. We give a Generalist Computer-Use Agent (GCUA) full GUI + CLI access and let it solve tasks however it would — click, type, script, browse. We just grade the outcome.
Huge thanks to my postdoc @YiyouSun for leading this massive effort, and our amazing team! The dataset and leaderboard are open. 🧵👇
“AI agents will outperform humans at almost all jobs by 2026–2027.” - The forecast is everywhere. So we built the exam to test that claim, on real labor-market aligned work. On the hardest tier, top agents pass 2.6%. Meet Agents' Last Exam (ALE), a rolling benchmark measuring whether agents can actually do real jobs. 🧵👇

1/ Where do the tasks come from?
Every task is a real project that a human expert has already shipped, turned into a code-graded test.
No vibes, no human judge, fully reproducible. Spanning 55 non-physical industries, grounded in O*NET / SOC 2018 (the U.S. federal occupational taxonomy).
Built by 300+ experts across 100+ institutions.

6/ Come test your agents on ALE → Website: http://agents-last-exam.org Task Samples: http://agents-last-exam.org/demo Paper: https://arxiv.org/abs/2606.05405 HuggingFace: https://huggingface.co/datasets/agents-last-exam/agents-last-exam Code: http://github.com/rdi-berkeley/agents-last-exam

2/ Which agent is leading?
Current podium (harness + flagship model):
🥇 Codex (gpt-5-5) 🥈 Cursor (composer-2-5) 🥉 Claude Code (opus-4-8)
See more at https://agents-last-exam.org/leaderboard.
Beyond the leaderboard, where do top-performing agents perform differently? Full analysis coming soon https://agents-last-exam.org/blog.

5/ What kind of agent are we focusing? We equip the Generalist Computer-Use Agent (GCUA) with full access, GUI, and CLI. We don't constrain how the agent solves a task. Whatever a human could do on a computer, the agent is free to do: click, type, script, browse, automate.
It's judged on the result, not the method.

3/ How does ALE compare to existing agent benchmarks?
Today's agent benchmarks are getting saturated fast. ALE sits in a different corner of the map:
• 55 industry domains • 1,500+ tasks • Tasks spanning both GUI and CLI
Top-tier agents pass just 26% overall, and only 2.6% on the Last-Exam tier.
Only have a CLI agent? That's fine. We ship ALE-CLI, the terminal-only subset of ALE.

4/ Why we call it "Last Exam"?
Because the day agents saturate ALE is the day they can actually power real industries.
That day is not today. But it's the one worth measuring and optimizing toward.

7/ We are proud to have a distinguished advisory committee @yannakakis, @gallantlab, @thg_lab, @yaminirangan, Tapio Schneider, Laure Zanna, @Idasim, Arvind Rao, @brad_rothenberg, @kaanozbay, Tarek Zohdi, Georgios Yannakakis, Carl Boettiger, @ksteinfe, Patrick Bryant guiding our industry landscape and task collection, and are deeply grateful for the compute and API support from @BerkeleyRDI, RDI Foundation @ChenInstitute, @UniPat_AI, @SnorkelAI (Open Benchmarks Grants program), as well as the dedication of our amazing organizing and execution team, along with all other data contributors to the ALE benchmark. This would not have happened without you!

@YiyouSun Good timing with fable releasing

@YiyouSun huge step towards generalist agentic eval!

@YiyouSun great work!!!

@mkemka_ 😵 Testing it right now.

@lupantech Great work Pan!

@YiyouSun Real job tasks are much harder than simple tests.

@SnorkelAI @amanda_dsouza @vincentsunnchen 2.6% on hardest, but we once got a 4% bump from test cases leaking into the prompt , code-graded evals are tricky

@YiyouSun the real unlock isn't making ai better at tasks, it's defining what tasks matter in the first place