actAVA releases CHI-Bench benchmark where 30 frontier AI agents reach at most 28 percent success on 75 long-horizon U.S. healthcare tasks
Tasks run 60-80 steps across simulators of 21 apps and a 1,279-document handbook.
CHI-Bench is open under Apache 2.0; the leaderboard accepts community submissions today. 🤖Github: https://github.com/actava-ai/chi-bench 🤗HuggingFace: https://huggingface.co/datasets/actava/chi-bench 🏆Leaderboard: https://actava.ai/benchmarks 📝arXiv: https://arxiv.org/pdf/2605.16679
In real healthcare operations, agents must do far more than answer medical questions. They need to read charts, interpret clinical and operational policies, verify coverage, route referrals, draft P2P scripts, and finalize care plans — where a single policy violation can mean a denied claim or missed patient outcome. @actAVAai @iscreamnearby led and developed CHI-Bench (Clinical Healthcare In-situ Benchmark), the first long-horizon, policy-rich benchmark for AI agents operating across end-to-end U.S. healthcare workflows. Key highlights: ▶️ High-fidelity simulators for Provider Prior Authorization, Payer Utilization Management, and Population Health Care Management, all exposed as MCP servers over patient, clinician, and insurer records. 🧪 Each trial runs 60–80 agent steps across 4–6 clinical stages, with access to 21 healthcare apps, 200+ MCP tools, and a 1,279-document operations handbook. Leaderboard results across 30 frontier agents: • Claude Code + Opus 4.6: 28% pass@1 • Codex + GPT-5.5: 21% • Utilization review: 41% • Care management: 32% • Prior authorization: 29% Reliability remains a major challenge: no agent exceeds 20% when the same case is repeated three times.