AI is creating problems it still can’t solve.
The same technology poised to automate millions of jobs still can’t reliably help people navigate SNAP — the food assistance program 40 million Americans depend on. We built the first benchmark to measure that.
Partnering with Center for Civic Futures and @codeforamerica , we scored models on SNAP question scenarios which users would have to navigate, with expected response rubrics validated by policy experts to match practice considerations.
The best model only scored 62%. Models handle federal questions like appeals and recertification reasonably well, but fall short on state-specific ones like replacing an EBT card. Benefits are administered by the states, meaning models are weakest where people need them most.
The same technology poised to automate millions of jobs should at least help strengthen the social safety net for the people it could displace. As AI adoption proliferates into public services, governments need a reliable way to test these tools before deploying them.