AI evaluation is entering an interactive benchmark era.
Across tool-use agents, web/OS benchmarks, multi-agent systems, and reliability evaluations, interaction is becoming central to how modern AI systems are tested.
But the field risks adding interaction faster than it develops the scientific principles for evaluating interaction.
Our position:
Interactive evaluation is not just longer tasks, tool use, or multi-turn interaction.
It requires a design science for mapping trajectories to valid evaluative claims.
📄 https://arxiv.org/abs/2605.17829
💻 https://github.com/keyangds/interactive_evaluation