/AI8h ago

Benchmark Shows AI Models Score Only 62% on SNAP Navigation

1787153013.1K

Original post

Vals AI@ValsAI

AI is creating problems it still can’t solve.

The same technology poised to automate millions of jobs still can’t reliably help people navigate SNAP — the food assistance program 40 million Americans depend on. We built the first benchmark to measure that.

Partnering with Center for Civic Futures and @codeforamerica , we scored models on SNAP question scenarios which users would have to navigate, with expected response rubrics validated by policy experts to match practice considerations.

The best model only scored 62%. Models handle federal questions like appeals and recertification reasonably well, but fall short on state-specific ones like replacing an EBT card. Benefits are administered by the states, meaning models are weakest where people need them most.

The same technology poised to automate millions of jobs should at least help strengthen the social safety net for the people it could displace. As AI adoption proliferates into public services, governments need a reliable way to test these tools before deploying them.

8:58 AM · Jun 9, 2026 · 12.9K Views

Sentiment

Sentiment building, check back later.

Cluster Engagement

Posts from X

Most Activity

VIEWS212BOOKMARKS1LIKES1

rishi@RishiBommasani

Very good to see work that provides a clearer throughline from AI capabilities to public benefits.

Evals in this area are scarce but returns could be high

Vals AI@ValsAI

AI is creating problems it still can’t solve.

The same technology poised to automate millions of jobs still can’t reliably help people navigate SNAP — the food assistance program 40 million Americans depend on. We built the first benchmark to measure that.

Partnering with Center for Civic Futures and @codeforamerica , we scored models on SNAP question scenarios which users would have to navigate, with expected response rubrics validated by policy experts to match practice considerations.

The best model only scored 62%. Models handle federal questions like appeals and recertification reasonably well, but fall short on state-specific ones like replacing an EBT card. Benefits are administered by the states, meaning models are weakest where people need them most.

The same technology poised to automate millions of jobs should at least help strengthen the social safety net for the people it could displace. As AI adoption proliferates into public services, governments need a reliable way to test these tools before deploying them.

2h21211