Very excited to share that our paper "Towards a Science of AI Agent Reliability" was accepted at ICML 2026! See you in Seoul! 🎉
We just released our camera ready version with three important updates (details below). We also recorded a short video on the paper's contributions.
Main changes (full discussion at https://hal.cs.princeton.edu/reliability/#updates): 1️⃣We have added the latest set of frontier models to our evaluation (GPT 5.5, Gemini 3.1 Pro and 3.5 Flash, and Claude Opus 4.7) and find that they are not meaningfully more reliable than previously released models. Agent reliability is still far from being solved. 2️⃣We have updated the definition and measurement of our outcome consistency metric, which contained a typo in the pre-print we initially released. This caused us to under-estimate outcome consistency in our initial set of results. We have updated the paper and our codebase to the corrected metric. Despite this change, our new results show that outcome consistency is still surprisingly low across many reported models. 3️⃣We discovered multiple issues in our HAL Generalist Agent scaffold that we used for our experiments on GAIA. Notably, we discovered multiple instances of answer leakage and agents cheating on our evaluation. This caused us to slightly over-estimate both accuracy and reliability. At the same time, we noticed that the scaffold was overly constrained in terms of permissible software library imports. This caused us to slightly under-estimate both accuracy and reliability. We have done a rigorous audit of the scaffold and have fixed those issues. Overall, we saw that our resulting accuracy and reliability numbers are not meaningfully impacted by this change when compared to our original numbers.
📄Our paper: https://arxiv.org/abs/2602.16666 📊Our dashboard: https://hal.cs.princeton.edu/reliability/ 🎥Short video: https://youtu.be/qftDfEft7U0
Joint work w/ @sayashk, @PKirgis, @khl53182440, @SaitejaUtpala, and @random_walker.