Mingqian Zheng releases CARRYONBENCH benchmark for LLM clarification
——0——
Mingqian Zheng released CARRYONBENCH, an interactive benchmark of 5,970 simulated conversations across 14 models. It measures whether large language models revise initial refusals of ambiguous but benign queries once users clarify intent. Single-turn fulfillment rates after clarification range from 10.5 percent to 37.6 percent. The evaluation surfaces recurring failure modes including utility lock-in that blocks recovery and unsafe revisions that weaken the original refusal.