Looks like some folks tested GLM 5.2 on CTF challenges and found roughly on par performance with Opus 4.7. Unclear whether this would generalize to the AISI evals. If reliable, I think with dedicated RL/TTT on GLM, can probably reach prompt-only Mythos levels in months/weeks.
It's a bit odd to me that in a lot of the cyber risk frontier evals out there, open models are not reported. I really want to know where GLM 5.2/Kimi 2.7 sit on this AISI eval. What's the true marginal cyber risk of Mythos/Fable/GPT5 over open models?