GLM-5.2 is great at design (Opus level IMO).
I am also starting to see great results with long-running tasks, too.
How is this possible?
I think there are a few clever hacks. But I just came across this from the official blog, and they actually trained this model with an anti-hacking module.
RL, as many know, comes with this issue of reward hacking that often enables the model to take weird and suboptimal shortcuts. Not only that, but it makes the models sometimes feel like it's sometimes "lazy" or just plain "dumb" at times, including other issues like intent misalignment, verbosity, sycophancy, deception, etc. And you really don't want that for long-running tasks operated by coding agents.
This is a great insight. If you use the standard /goal (in 5.5 or 4.8), you notice the models often take shortcuts that lead to long-running tasks (wasting tokens along the way) but with poor results. This is why I advocate for a focus on better verifiers.
So this anti-hacking idea is a model capability that should, in theory, lead to better results on long-horizon tasks.
I've seen efforts here and there in a few research papers, but haven't seen it translated to much, much less in a frontier, open-weight model.
This might be contributing to some of the great results we are seeing with GLM-5.2, but I suspect there is more, of course, like better verification capabilities. It's not clear how all of these training signals lead to downstream capabilities, but this is something to look at closely with newer models.