We need Robotics MMLU for generalist benchmarking, each of these bench reveal some properties and people can always model / inject data in a way to “win” the artificial waypoints, while simple disturbance will make policies not work
The robolab leaderboard is interesting -- still fairly noisy (i.e. not the same as other leaderboards like RoboArena or MolmoSpaces). Suggests we're pretty far from a truly general-purpose robotics model, IMO. the data it's trained on is still a huge differentiator.


