/Tech3h ago

Dawn Song and Yiyou Sun launch Agents' Last Exam, finding top AI agents score just 2.6% on hard professional tasks

The dataset contains 1,500 code-graded tasks across 55 subdomains.

20138563516.5K

#442

Original post

Alex Ratner#1470

Snorkel AI@SnorkelAI

We're proud Snorkel AI is part of Agents' Last Exam, with our researchers @amanda_dsouza and @vincentsunnchen among the co-authors and support from our Open Benchmarks Grants initiative.

The forecast: agents will do almost every job by 2027. The result on real, code-graded work? Top agents pass just 2.6% on the hardest tier.

Excited to keep pushing this forward with @YiyouSun, @dawnsongtweets and the @BerkeleyRDI team. 👇

Yiyou Sun@YiyouSun

“AI agents will outperform humans at almost all jobs by 2026–2027.” - The forecast is everywhere. So we built the exam to test that claim, on real labor-market aligned work. On the hardest tier, top agents pass 2.6%. Meet Agents' Last Exam (ALE), a rolling benchmark measuring whether agents can actually do real jobs. 🧵👇

10:15 AM · Jun 9, 2026 · 4.6K Views

/Tech3h ago

Dawn Song and Yiyou Sun launch Agents' Last Exam, finding top AI agents score just 2.6% on hard professional tasks

The dataset contains 1,500 code-graded tasks across 55 subdomains.

20138563516.5K

#442

Original post

Alex Ratner#1470

Snorkel AI@SnorkelAI

We're proud Snorkel AI is part of Agents' Last Exam, with our researchers @amanda_dsouza and @vincentsunnchen among the co-authors and support from our Open Benchmarks Grants initiative.

The forecast: agents will do almost every job by 2027. The result on real, code-graded work? Top agents pass just 2.6% on the hardest tier.

Excited to keep pushing this forward with @YiyouSun, @dawnsongtweets and the @BerkeleyRDI team. 👇

Yiyou Sun@YiyouSun

10:15 AM · Jun 9, 2026 · 4.6K Views

Sentiment

Users are praising the authors of the Agents' Last Exam benchmark for their great work revealing AI agents' low 2.6% pass rate on the hardest tasks.

Pos

100.0%

Neg

0.0%

1 comments with sentiment.

Cluster Engagement

Posts from X

Most Activity

VIEWS520BOOKMARKS1LIKES8RETWEETS2

Dawn Song@dawnsongtweets

My group & collaborators have built many of the benchmarks the field now runs on — MMLU, MATH, CyberGym, ExploitGym, etc.. I'm really excited to share our latest: Agents' Last Exam (ALE).

Why "Last Exam"? The name has two meanings: "Last" as the bar you have to clear — passing these exams means an agent can actually do the job and keep doing valuable work in that profession.

"Last" as the difficulty of the tasks — tasks are real, long, and need professional knowledge in execution, ALE sits right at the boundary of what today's agents can reliably accomplish.

A few things that make ALE different:

• Real work, not vibes. Every one of the 1,500+ tasks was sourced from human experts’ past projects or research. We turned them into verifiable tests, scored deterministicly. No human judges.

• Built for breadth. 55 non-physical industries, grounded in the O*NET / SOC 2018 federal occupational taxonomy, assembled by 300+ experts across 100+ institutions.

• Judged on results, not method. We give a Generalist Computer-Use Agent (GCUA) full GUI + CLI access and let it solve tasks however it would — click, type, script, browse. We just grade the outcome.

Huge thanks to my postdoc @YiyouSun for leading this massive effort, and our amazing team! The dataset and leaderboard are open. 🧵👇

Yiyou Sun@YiyouSun

2h52081

REPLIES1

Yiyou Sun@YiyouSun

6/ Come test your agents on ALE → Website: http://agents-last-exam.org Task Samples: http://agents-last-exam.org/demo Paper: https://arxiv.org/abs/2606.05405 HuggingFace: https://huggingface.co/datasets/agents-last-exam/agents-last-exam Code: http://github.com/rdi-berkeley/agents-last-exam

3h846

Yiyou Sun@YiyouSun

1/ Where do the tasks come from?

Every task is a real project that a human expert has already shipped, turned into a code-graded test.

No vibes, no human judge, fully reproducible. Spanning 55 non-physical industries, grounded in O*NET / SOC 2018 (the U.S. federal occupational taxonomy).

Built by 300+ experts across 100+ institutions.

3h1176

Yiyou Sun@YiyouSun

2/ Which agent is leading?

Current podium (harness + flagship model):

🥇 Codex (gpt-5-5) 🥈 Cursor (composer-2-5) 🥉 Claude Code (opus-4-8)

See more at https://agents-last-exam.org/leaderboard.

Beyond the leaderboard, where do top-performing agents perform differently? Full analysis coming soon https://agents-last-exam.org/blog.

3h1185

Yiyou Sun@YiyouSun

5/ What kind of agent are we focusing? We equip the Generalist Computer-Use Agent (GCUA) with full access, GUI, and CLI. We don't constrain how the agent solves a task. Whatever a human could do on a computer, the agent is free to do: click, type, script, browse, automate.

It's judged on the result, not the method.

3h815

Yiyou Sun@YiyouSun

3/ How does ALE compare to existing agent benchmarks?

Today's agent benchmarks are getting saturated fast. ALE sits in a different corner of the map:

• 55 industry domains • 1,500+ tasks • Tasks spanning both GUI and CLI

Top-tier agents pass just 26% overall, and only 2.6% on the Last-Exam tier.

Only have a CLI agent? That's fine. We ship ALE-CLI, the terminal-only subset of ALE.

3h805

Yiyou Sun@YiyouSun

4/ Why we call it "Last Exam"?

Because the day agents saturate ALE is the day they can actually power real industries.

That day is not today. But it's the one worth measuring and optimizing toward.

3h665

Yiyou Sun@YiyouSun

7/ We are proud to have a distinguished advisory committee @yannakakis, @gallantlab, @thg_lab, @yaminirangan, Tapio Schneider, Laure Zanna, @Idasim, Arvind Rao, @brad_rothenberg, @kaanozbay, Tarek Zohdi, Georgios Yannakakis, Carl Boettiger, @ksteinfe, Patrick Bryant guiding our industry landscape and task collection, and are deeply grateful for the compute and API support from @BerkeleyRDI, RDI Foundation @ChenInstitute, @UniPat_AI, @SnorkelAI (Open Benchmarks Grants program), as well as the dedication of our amazing organizing and execution team, along with all other data contributors to the ALE benchmark. This would not have happened without you!

3h735

Lucas@lu_shuo_

@YiyouSun great work!!!

3h281