/AI20h ago

Agents' Last Exam Benchmark Tests AI Agents On 1,000+ Economic Tasks

20147281209.9K
Original postelvis#483
DAIR.AI@dair_ai

// Agents' Last Exam //

Agents' Last Exam is a living benchmark of over 1,000 economically valuable tasks, built with 250+ industry experts and mapped to the U.S. federal occupational taxonomy.

The hardest tier sits at a 2.6% average full pass rate across mainstream harnesses and backbones.

ALE behaves like a GDP-coverage instrument instead of another test that saturates in a month.

Paper: https://arxiv.org/abs/2606.05405

Learn to build effective AI agents in our academy: https://academy.dair.ai/

8:18 AM · Jun 5, 2026 · 9.9K Views
Sentiment

Users praise the Agents' Last Exam benchmark for its insightful industry taxonomy and the sobering yet useful 2.6% pass rate as a data point on AI agent capabilities.

Pos
100.0%
Neg
0.0%
2 comments with sentiment.
Cluster Engagement
Posts from X
Most Activity
Most Activity
VIEWS18
Christopher@communicating

@dair_ai I think that (2.6% pass rate) is a sobering but incredibly useful data point.

Of course the question is what the human pass rate is.

We know (ok I’m assuming knowing humans) it’s not 100% but it matters if it’s -20% or +80%.

20hViews 18
Strata@ChainZenit

@dair_ai 2.6% pass rate just means we're still miles away from autonomous yield farmers.

20hViews 14
Alex YGift@Radipdegen

@dair_ai kind of wild that only 2.6% pass the hardest tier

wonder how much of that is just orchestration failing

20hViews 13
Rugbist@rugbist_

@dair_ai Tier list benchmarks are cute but 2.6% on real-world tasks feels more like a filter than a metric

what happens when models train on the test set?

20hViews 12
Blissy@BlissyOnX

@dair_ai industry mapping is the interesting part here. the taxonomy approach might actually filter out lab-designed gotchas.

20hViews 11
Ferbin@Ferbin08

@dair_ai 2.6% makes sense - real work is genuinely hard. But how much is agent design versus the test being too strict? If the test is too rigid, agents optimize for passing it instead of solving actual problems.

20hViews 9