40% of benchmarking effort targets math/coding, but the related occupations are only 3.5% of US jobs.
We introduce EconEvals, an open-source evaluation suite to measure capabilities and predict job disruption across the US labor economy.
Math and coding represent only 3.5% of US jobs.
40% of benchmarking effort targets math/coding, but the related occupations are only 3.5% of US jobs.
We introduce EconEvals, an open-source evaluation suite to measure capabilities and predict job disruption across the US labor economy.
Users in the replies dismiss EconEvals AI job disruption predictions as nonsensical because the benchmark focuses on coding tasks while ignoring 96.5% of actual jobs.
No Digg Deeper questions have been answered for this story yet.

We built benchmarks for 143 work activities based on real user queries. Each query is retrieved and classified from open-source datasets like WildChat.
40% of benchmarking effort targets math/coding, but the related occupations are only 3.5% of US jobs.
We introduce EconEvals, an open-source evaluation suite to measure capabilities and predict job disruption across the US labor economy.

Finally, because each time-savings estimate is whitebox, we inspect the factors behind the predictions by categorizing the simulation's justifications. Our exposure estimates surface specific real-world bottlenecks that limit AI-enabled time savings on a task-by-task basis.

Our benchmarks span every non-military major occupational group in the US labor economy according to the Department of Labor's taxonomy. We include work activities across domains like management, finance, science, law, education, health care, food prep, engineering, and sales.

Read more: http://econevals.com Code: https://github.com/EconEvals/EconomicEvaluations Data: https://huggingface.co/datasets/EconEvals/EconEvalsJune2026 And, of course, this project couldn’t have been possible without my coauthors @HatgisKessell @t6aguirre @percyliang and @RishiBommasani !

However, building benchmarks from usage data only lets us measure tasks that people already use AI for. We want to evaluate AI on all use-cases.
We introduce a simulation-based synthetic data generation pipeline to cover (essentially) all U.S. work.

For each task and occupation, we create a worker persona and then have GPT-5-mini roleplay as a worker with this persona responding to an interviewer asking about time savings. Using this synthetic data, we evaluate models on 40 work activities and 43 occupations in GDPval.

How do AI capabilities impact the labor market?
Benchmarks compare models but do not answer this. Economists have developed the alternative lens of exposure that centers real-world impact: the amount of time AI saves workers.
We propose a new method of estimating exposure.

However, we also find that current usage (according to the Anthropic Economic Index) lags behind theoretical exposure across all groups of occupations. In other words, AI could save workers substantially more time on tasks.

We introduce the first whitebox exposure method. An LM roleplays a "worker," decomposes each task into steps, and estimates the time-savings for each step given current LM capabilities. Like prior methods it gives a number, but now you can inspect the reasoning trace.

We apply this method to estimate AI-enabled time-savings across 18,796 O*NET tasks. We find that 47% of occupations can save substantial time for at least half of their tasks.

Existing exposure estimates collapse this into a single prompt: ask an LM or person to provide an estimate of time-savings based on a short description of the task and a list of capabilities. However, this hides the justification and can't surface new capabilities or bottlenecks.

@alexwan55 We optimize for coding benchmarks but ignore 96.5% of actual jobs. No wonder these disruption predictions make no sense.