Sorry, I don’t mean “I’m cheating” to imply I’m benchmaxxing or actually cheating.
I mean if what data went into a model was as clear cut as a single number for all scales, etc. than the auto researchers probably would already outpace humans.
My point is that defining good isn’t even easy to do for what we want out of base/midtrained models. So we cannot produce an easy metric, and as such we cannot give an auto researcher an easy hill to climb.
It’s not that the work isn’t hill climbing, it’s that without something to give you a dead reckoning I don’t know that the auto researchers are really up to the task yet.