I think people are taking results from things with really easy and meaningful metrics like loss / perplexity and assuming it applies more broadly than it really does.
I promise I can out data the auto researchers right now. I’m cheating because it’s a messy objective, but that’s the point right?
I feel like manual hill-climbing research now would be like trying to write code to beat neural networks 10 years ago



