We use a combination of Claude Code @AnthropicAI (Opus 4.8) and Codex @OpenAI (GPT 5.5). The ScaleAutoResearch pipeline is very similar to the one used for Ramsey numbers, but we replace the resources and domain-specific human intuitions with those for optimizer design. The method they found uses simple tricks to reduce the steps:
(1) Longer 2nd-moment memory for the 1-D "aux" params (RMSNorm gains, biases): Adam β₂ 0.99 → 0.997 (0.9965 for the attn-proj bias). Agents' reason: these 1-D params get no cross-coordinate averaging, so at β₂=0.99 their variance estimate is noisy and jitters the step size, so a longer memory steadies it. (2875 → 2830 steps)
(2) SOAP on all hidden matrices, refreshed every step: MLP+V → +q/k/attn-proj, and precondition_frequency 10 → 1. Agents' reason: if SOAP curvature helps MLP/V, the other hidden matrices should too, and an every-step eigenbasis tracks the moving curvature better (~29% more time/step). (2830 → 2800 steps)
(3) Shorter LR-cooldown horizon + momentum tuning, then prune lots of now-redundant components: Circuit-/Contra-/Soft-Muon, Aurora, NorMuon 2nd-moment, V-SOAP-blend, attn-SOAP denom-floor. Agents' reason: on the faster trajectory the LR should anneal sooner, and once SOAP covers every matrix the older geometry tricks are redundant, so pruning them also buys back ~19%/step. (2800 → 2755 steps)
[3/n]