yes there's an actual useful capability gain
but these gains are narrower and not comparable to anthropic or OpenAI models
people like to use individual benchmarks to show open models are as good as closed models, but it doesn't work that way
a score of 80% on SWE-Bench Verified is more meaningful for Anthropic models than they are for MiniMax models for example
@scaling01 @cheatyyyy hillclimbing on a measurable proxy of downstream capability is legitimate and what benchmarks originally existed for I don't think they benchmaxed on DeepSWE as such though