/Tech12h ago

Stanford's Rishi Bommasani and Percy Liang launch EconEvals to benchmark AI impact across the US labor market

Math and coding represent only 3.5% of US jobs.

65814264.5K

#24

Original post

Alexander Wan@alexwan55

40% of benchmarking effort targets math/coding, but the related occupations are only 3.5% of US jobs.

We introduce EconEvals, an open-source evaluation suite to measure capabilities and predict job disruption across the US labor economy.

2:20 PM · Jun 24, 2026 · 4.5K Views

Sentiment

Users in the replies dismiss EconEvals AI job disruption predictions as nonsensical because the benchmark focuses on coding tasks while ignoring 96.5% of actual jobs.

Pos

0.0%

Neg

100.0%

1 comments with sentiment.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Posts from X

Most Activity

VIEWS146LIKES5

Alexander Wan@alexwan55

We built benchmarks for 143 work activities based on real user queries. Each query is retrieved and classified from open-source datasets like WildChat.

13h1465

RETWEETS14

Alexander Wan@alexwan55

40% of benchmarking effort targets math/coding, but the related occupations are only 3.5% of US jobs.

We introduce EconEvals, an open-source evaluation suite to measure capabilities and predict job disruption across the US labor economy.

13h4.5K5826

REPLIES1

Alexander Wan@alexwan55

Finally, because each time-savings estimate is whitebox, we inspect the factors behind the predictions by categorizing the simulation's justifications. Our exposure estimates surface specific real-world bottlenecks that limit AI-enabled time savings on a task-by-task basis.

13h963

Alexander Wan@alexwan55

Our benchmarks span every non-military major occupational group in the US labor economy according to the Department of Labor's taxonomy. We include work activities across domains like management, finance, science, law, education, health care, food prep, engineering, and sales.

13h1254

Alexander Wan@alexwan55

Read more: http://econevals.com Code: https://github.com/EconEvals/EconomicEvaluations Data: https://huggingface.co/datasets/EconEvals/EconEvalsJune2026 And, of course, this project couldn’t have been possible without my coauthors @HatgisKessell @t6aguirre @percyliang and @RishiBommasani !

12h1054

Alexander Wan@alexwan55

However, building benchmarks from usage data only lets us measure tasks that people already use AI for. We want to evaluate AI on all use-cases.

We introduce a simulation-based synthetic data generation pipeline to cover (essentially) all U.S. work.

13h953

Alexander Wan@alexwan55

For each task and occupation, we create a worker persona and then have GPT-5-mini roleplay as a worker with this persona responding to an interviewer asking about time savings. Using this synthetic data, we evaluate models on 40 work activities and 43 occupations in GDPval.

13h923

Alexander Wan@alexwan55

How do AI capabilities impact the labor market?

Benchmarks compare models but do not answer this. Economists have developed the alternative lens of exposure that centers real-world impact: the amount of time AI saves workers.

We propose a new method of estimating exposure.

13h773

Alexander Wan@alexwan55

However, we also find that current usage (according to the Anthropic Economic Index) lags behind theoretical exposure across all groups of occupations. In other words, AI could save workers substantially more time on tasks.

13h723

Alexander Wan@alexwan55

We introduce the first whitebox exposure method. An LM roleplays a "worker," decomposes each task into steps, and estimates the time-savings for each step given current LM capabilities. Like prior methods it gives a number, but now you can inspect the reasoning trace.

13h723

Alexander Wan@alexwan55

We apply this method to estimate AI-enabled time-savings across 18,796 O*NET tasks. We find that 47% of occupations can save substantial time for at least half of their tasks.

13h703

Alexander Wan@alexwan55

Existing exposure estimates collapse this into a single prompt: ask an LM or person to provide an estimate of time-savings based on a short description of the task and a list of capabilities. However, this hides the justification and can't surface new capabilities or bottlenecks.

13h703

High Jack@jackadoresai

@alexwan55 We optimize for coding benchmarks but ignore 96.5% of actual jobs. No wonder these disruption predictions make no sense.