Occasionally I write about robotics evaluation and how hard it is to tell which models are actually the best. Right now this is sort of "privileged information," known only to a select few, but hopefully one day we will be able to tell via common benchmarks (like humanity's last exam and SWEBench), or via platforms like chatbot arena.
But today is not that day. I wrote up a quick blog post on benchmarks in robotics, how they're currently saying different things, and what that might mean