This is the beginning of serious autoresearch that can't be withdrawn with the push of a button by some responsible AI safety committee. No matter what happens next, we have early AI scientist assistants already. RSI can be local, if slow.
Here’s a fun comparison between GLM 5.2 and Opus 4.8 on a one-shot reproduction of the SDPO paper
This is a hard task: the model must resolve messy verl issues and then run ablations to completion and confirm the paper’s claims.
- GLM 5.2 costs $6.21 while Opus 4.8 cost us $46.35
- Both models spent a bulk of their tokens resolving initial verl issues. GLM 5.2 attempted 14 failed runs before first success while Opus 4.8 attempted 9 runs.
- GLM 5.2 surprisingly took 2.65M tokens (excl re-reads) compared to 4.53M tokens for Opus 4.8

