4h ago

actAVA releases CHI-Bench benchmark where 30 frontier AI agents reach at most 28 percent success on 75 long-horizon U.S. healthcare tasks

Tasks run 60-80 steps across simulators of 21 apps and a 1,279-document handbook.

286731152.0K

——0——

Original post

In real healthcare operations, agents must do far more than answer medical questions. They need to read charts, interpret clinical and operational policies, verify coverage, route referrals, draft P2P scripts, and finalize care plans — where a single policy violation can mean a denied claim or missed patient outcome. @actAVAai @iscreamnearby led and developed CHI-Bench (Clinical Healthcare In-situ Benchmark), the first long-horizon, policy-rich benchmark for AI agents operating across end-to-end U.S. healthcare workflows. Key highlights: ▶️ High-fidelity simulators for Provider Prior Authorization, Payer Utilization Management, and Population Health Care Management, all exposed as MCP servers over patient, clinician, and insurer records. 🧪 Each trial runs 60–80 agent steps across 4–6 clinical stages, with access to 21 healthcare apps, 200+ MCP tools, and a 1,279-document operations handbook. Leaderboard results across 30 frontier agents: • Claude Code + Opus 4.6: 28% pass@1 • Codex + GPT-5.5: 21% • Utilization review: 41% • Care management: 32% • Prior authorization: 29% Reliability remains a major challenge: no agent exceeds 20% when the same case is repeated three times.

9:14 AM · May 20, 2026

Reposted by

#1085@SANMIKOYEJO

#658Caiming Xiong@CAIMINGXIONG

CHI-Bench is open under Apache 2.0; the leaderboard accepts community submissions today. 🤖Github: https://github.com/actava-ai/chi-bench 🤗HuggingFace: https://huggingface.co/datasets/actava/chi-bench 🏆Leaderboard: https://actava.ai/benchmarks 📝arXiv: https://arxiv.org/pdf/2605.16679

Caiming Xiong@CaimingXiong

4:14 PM · May 20, 2026 · 445 Views

4:14 PM · May 20, 2026 · 302 Views

actAVA releases CHI-Bench benchmark where 30 frontier AI agents reach at most 28 percent success on 75 long-horizon U.S. healthcare tasks

Cluster engagement

Sentiment