Reality is the only benchmark that actually matters. This is because it is grounded in objective truth and therefore cannot be overfit or gamed. Arena is built to measure the post-deployment characteristics of AI in the hands of real users.
Then, what should we measure? What if we went beyond human preference and started measuring everything?