/Tech1d ago

Benchmark Shows AI Models Score Only 62% on SNAP Navigation

20105153716.5K
Original postrishi#236
Vals AI@ValsAI

AI is creating problems it still can’t solve.

The same technology poised to automate millions of jobs still can’t reliably help people navigate SNAP — the food assistance program 40 million Americans depend on. We built the first benchmark to measure that.

Partnering with Center for Civic Futures and @codeforamerica , we scored models on SNAP question scenarios which users would have to navigate, with expected response rubrics validated by policy experts to match practice considerations.

The best model only scored 62%. Models handle federal questions like appeals and recertification reasonably well, but fall short on state-specific ones like replacing an EBT card. Benefits are administered by the states, meaning models are weakest where people need them most.

The same technology poised to automate millions of jobs should at least help strengthen the social safety net for the people it could displace. As AI adoption proliferates into public services, governments need a reliable way to test these tools before deploying them.

8:58 AM · Jun 9, 2026 · 15.9K Views
Sentiment
Sentiment building, check back later.
Cluster Engagement
Posts from X
Most Activity
Most Activity
VIEWS667BOOKMARKS2LIKES3
rishi@RishiBommasani

Very good to see work that provides a clearer throughline from AI capabilities to public benefits.

Evals in this area are scarce but returns could be high

Vals AI@ValsAI

AI is creating problems it still can’t solve.

The same technology poised to automate millions of jobs still can’t reliably help people navigate SNAP — the food assistance program 40 million Americans depend on. We built the first benchmark to measure that.

Partnering with Center for Civic Futures and @codeforamerica , we scored models on SNAP question scenarios which users would have to navigate, with expected response rubrics validated by policy experts to match practice considerations.

The best model only scored 62%. Models handle federal questions like appeals and recertification reasonably well, but fall short on state-specific ones like replacing an EBT card. Benefits are administered by the states, meaning models are weakest where people need them most.

The same technology poised to automate millions of jobs should at least help strengthen the social safety net for the people it could displace. As AI adoption proliferates into public services, governments need a reliable way to test these tools before deploying them.

1dViews 667Likes 3Bookmarks 2